diff --git a/mbproxy/CLAUDE.md b/mbproxy/CLAUDE.md index 6c6e256..5d06905 100644 --- a/mbproxy/CLAUDE.md +++ b/mbproxy/CLAUDE.md @@ -21,7 +21,7 @@ The integration win is that upstream consumers (Wonderware / Historian / OPC UA ## Architecture -The full design plan is in **[`docs/design.md`](docs/design.md)** — settled 2026-05-13, updated for Phase 9 multiplexing on 2026-05-14. Headline choices the agent should keep in mind without opening that file: +The full architecture is documented under **[`docs/`](docs/)** — see the `Architecture/`, `Features/`, `Operations/`, `Reference/`, and `Testing/` pages. Headline choices the agent should keep in mind without opening those files: - **One `TcpListener` per PLC** (54 distinct ports). Each PLC has **one shared backend socket** owned by a `PlcMultiplexer`; many upstream clients are multiplexed onto that single backend via MBAP TxId rewriting (Phase 9). The H2-ECOM100's 4-client cap no longer caps upstream connections. - **Transparent by default; opt-in cached** (Phase 11). Every byte passes through unchanged except the MBAP TxId field (rewritten by the multiplexer on each request and restored on each response) and FC03/FC04 response payloads + FC06/FC16 request payloads at configured BCD addresses (re-encoded between BCD nibbles and binary integers). With Phase 11, FC03/FC04 reads for tags whose `CacheTtlMs > 0` may be served from a per-PLC in-process cache without backend traffic; the cache is **OFF by default** per tag. @@ -33,11 +33,11 @@ The full design plan is in **[`docs/design.md`](docs/design.md)** — settled 20 - **Backend disconnect cascades upstream**: when the shared backend socket dies, every attached upstream pipe is closed in the same cycle (counter `BackendDisconnectCascades`); clients reconnect on their next request. - **Read-only Kestrel admin port** (default 8080) exposes `GET /` (auto-refreshing HTML) and `GET /status.json` with service-wide and per-PLC counters (including Phase-9 mux fields, Phase-10 coalescing fields, and Phase-11 cache fields `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes`). -Anything beyond this short list — JSON schema, propagation table, stable log event names, status counter catalog, test plan — lives in `docs/design.md`. Open that doc before writing code; keep it in sync when decisions change. +Anything beyond this short list lives in the `docs/` tree: the appsettings.json schema in [`docs/Operations/Configuration.md`](docs/Operations/Configuration.md), config propagation in [`docs/Features/HotReload.md`](docs/Features/HotReload.md), stable log event names in [`docs/Reference/LogEvents.md`](docs/Reference/LogEvents.md), the status counter catalog in [`docs/Operations/StatusPage.md`](docs/Operations/StatusPage.md), and the simulator-backed test fixture in [`docs/Testing/Simulator.md`](docs/Testing/Simulator.md). Open the relevant page before writing code; keep it in sync when decisions change. ## Current state -**Implementation complete through Phase 11.** Phases 00–08 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model; Phase 10 added in-flight read coalescing on top; Phase 11 added an opt-in per-tag response cache (bounded staleness, OFF by default — see "Response cache" in `docs/design.md`). The service is production-ready as a Windows Service: +**Implementation complete through Phase 11.** Phases 00–08 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model; Phase 10 added in-flight read coalescing on top; Phase 11 added an opt-in per-tag response cache (bounded staleness, OFF by default — see [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md)). The service is production-ready as a Windows Service: - Test count grew through Phase 11 (see `tests/Mbproxy.Tests/` for the current suite; previous baseline was 325 = 282 unit + 43 E2E). - Single-file self-contained publish (`dotnet publish -c Release -r win-x64`). @@ -49,42 +49,43 @@ Anything beyond this short list — JSON schema, propagation table, stable log e - Phase 9 per-request watchdog defends against any backend that drops or mis-echoes a response (real-world packet loss; pymodbus 3.13 simulator's concurrent-multiplexed-request bug). - `AssemblyInformationalVersion` set to `1.0.0` (CI can override via `/p:InformationalVersion=...`). -The human-facing entry point is **[`README.md`](README.md)**. All design decisions remain in [`docs/design.md`](docs/design.md). +The human-facing entry point is **[`README.md`](README.md)**. All design decisions live in the [`docs/`](docs/) tree. -Constraints that still apply to this codebase (do not change without updating the design doc): +Constraints that still apply to this codebase (do not change without updating the relevant `docs/` page): - The csproj targets **.NET 10** (`net10.0`). This is the **only** tool in `wwtools/` not pinned to .NET Framework 4.8 / x86. -- The sample test `DL260/DL205BcdQuirkTests.cs` is a pattern reference only — its types are not available in this project. ## Device quirks (read before writing Modbus code) -The DL205/DL260 family is *almost* Modbus-spec-compliant, but every category below has at least one trap. The authoritative reference is **[`DL260/dl205.md`](DL260/dl205.md)** — read it end-to-end before touching the wire protocol. Highlights that bear directly on this proxy: +The DL205/DL260 family is *almost* Modbus-spec-compliant, but every category below has at least one trap. The authoritative reference is **[`docs/Reference/dl205.md`](docs/Reference/dl205.md)** — read it end-to-end before touching the wire protocol. Highlights that bear directly on this proxy: - **BCD-by-default numeric encoding.** `V2000 = 1234` stores `0x1234` on the wire, not `0x04D2`. This is the entire reason this service exists. - **CDAB word order for 32-bit values.** Low word first, big-endian bytes within each word. `0xAABBCCDD` lands as `[0xCC 0xDD][0xAA 0xBB]`. - **Octal V-memory ↔ decimal Modbus translation.** `V2000` octal = decimal 1024 = Modbus PDU `0x0400`. Config addresses are PDU-decimal, **not** octal V-memory and **not** 1-based 4xxxx. - **FC03/FC04 max qty = 128** (above spec's 125). **FC16 max qty = 100** (below spec's 123). The proxy passes these through; the PLC enforces the cap with exception 03. -- **Max 4 concurrent TCP clients per ECOM100.** Direct constraint on this proxy's 1:1 connection model — see [`docs/design.md`](docs/design.md) → "Connection model" for the band-aid-vs-rearchitect decision tree if this becomes a real problem. +- **Max 4 concurrent TCP clients per ECOM100.** This is why the proxy uses a single TxId-multiplexed backend socket per PLC — see [`docs/Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) for how the connection model lifts this cap. - **No TCP keepalive from the device.** Middleboxes typically drop idle sockets at 2–5 min. With the 1:1 model, backend liveness tracks upstream client liveness; if both are idle long enough, the path dies on its own and the next request reconnects. - **Register 0 is valid** on DL205/DL260 in factory "absolute" addressing mode — don't probe-skip it. -- **As-deployed PLC parameters** (captured in `DL260/mbtcp_settings.JPG`): port 502, "Use Concept data structures (Longs/Reals)" enabled, "Swap bytes" enabled, "Use Zero Based Addressing" **unchecked**, Register type = Binary, max coil read 1976 / coil write 800 / register read 122 / register write 100. The proxy must speak Modbus as-is; these settings describe the wire it'll see. +- **As-deployed PLC parameters** (captured in `docs/Reference/mbtcp_settings.JPG`): port 502, "Use Concept data structures (Longs/Reals)" enabled, "Swap bytes" enabled, "Use Zero Based Addressing" **unchecked**, Register type = Binary, max coil read 1976 / coil write 800 / register read 122 / register write 100. The proxy must speak Modbus as-is; these settings describe the wire it'll see. ## Resource index | Task | Go to | | --- | --- | -| Full architecture / design plan (decisions, schema, log events, status counters, test plan) | [`docs/design.md`](docs/design.md) | -| Phase-by-phase implementation plan (parallel-safety, phase gates, per-phase test list) | [`docs/plan/README.md`](docs/plan/README.md) | -| Dashboard KPI catalogue — what's exposed today and proposed additions (rates, percentiles, availability, fleet aggregates) | [`docs/kpi.md`](docs/kpi.md) | -| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits, exception codes, oddities) | [`DL260/dl205.md`](DL260/dl205.md) | -| pymodbus simulator profile that models those quirks as concrete register values | [`DL260/dl205.json`](DL260/dl205.json) | -| Example integration test pattern (xUnit + Shouldly + simulator fixture) | [`DL260/DL205BcdQuirkTests.cs`](DL260/DL205BcdQuirkTests.cs) | -| As-deployed PLC Modbus parameters screenshot | [`DL260/mbtcp_settings.JPG`](DL260/mbtcp_settings.JPG) | +| Architecture — listener topology, request flow, per-PLC isolation | [`docs/Architecture/Overview.md`](docs/Architecture/Overview.md) | +| Connection model — single backend socket per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade | [`docs/Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) | +| In-flight read coalescing / opt-in response cache | [`docs/Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md), [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) | +| BCD rewriting (codec, CDAB word order, FC03/04/06/16 scope) and config hot-reload | [`docs/Features/BcdRewriting.md`](docs/Features/BcdRewriting.md), [`docs/Features/HotReload.md`](docs/Features/HotReload.md) | +| Operations — full appsettings.json reference, status page / JSON fields, troubleshooting playbook | [`docs/Operations/Configuration.md`](docs/Operations/Configuration.md), [`docs/Operations/StatusPage.md`](docs/Operations/StatusPage.md), [`docs/Operations/Troubleshooting.md`](docs/Operations/Troubleshooting.md) | +| Stable `mbproxy.*` log event-name catalog | [`docs/Reference/LogEvents.md`](docs/Reference/LogEvents.md) | +| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits, exception codes, oddities) | [`docs/Reference/dl205.md`](docs/Reference/dl205.md) | +| pymodbus simulator profile that models those quirks as concrete register values | [`tests/sim/dl205.json`](tests/sim/dl205.json) | +| As-deployed PLC Modbus parameters screenshot | [`docs/Reference/mbtcp_settings.JPG`](docs/Reference/mbtcp_settings.JPG) | ## Maintenance Documentation doctrine for `wwtools/` lives in [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md). The three-layer rules apply: - **[`README.md`](README.md)** is the canonical human entry point (Layer-2 per DOCS-GUIDE). It routes to deep docs; it does not duplicate them. Update it when the service's public surface or install steps change. -- This `CLAUDE.md` stays a router for LLM coding agents. Deep design decisions live in [`docs/design.md`](docs/design.md); device quirks live in [`DL260/dl205.md`](DL260/dl205.md). When you change a design decision, update `docs/design.md` first (it's the source of truth) and only mirror the change into the Architecture summary above if it shifts one of the headline bullets. +- This `CLAUDE.md` stays a router for LLM coding agents. Deep design decisions live in the [`docs/`](docs/) tree; device quirks live in [`docs/Reference/dl205.md`](docs/Reference/dl205.md). When you change a design decision, update the relevant page under `docs/` first (it's the source of truth) and only mirror the change into the Architecture summary above if it shifts one of the headline bullets. - When the service's task→tool mapping changes in the root index, update [`../CLAUDE.md`](../CLAUDE.md) too. -- Any further work beyond Phase 08 belongs in a new design revision (dated, in `docs/design.md`) and a new phase plan. +- Any further design changes belong in the relevant `docs/` page (`Architecture/`, `Features/`, `Operations/`, `Reference/`, or `Testing/`). diff --git a/mbproxy/DL260/DL205BcdQuirkTests.cs b/mbproxy/DL260/DL205BcdQuirkTests.cs deleted file mode 100644 index 9cb0aac..0000000 --- a/mbproxy/DL260/DL205BcdQuirkTests.cs +++ /dev/null @@ -1,56 +0,0 @@ -using Shouldly; -using Xunit; - -namespace ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests.DL205; - -/// -/// Verifies DL205/DL260 binary-coded-decimal register handling against the -/// dl205.json pymodbus profile. HR[1072] = 0x1234 on the profile represents -/// decimal 1234 (BCD nibbles). Reading it as would -/// return 0x1234 = 4660; the path decodes 1234. -/// -[Collection(ModbusSimulatorCollection.Name)] -[Trait("Category", "Integration")] -[Trait("Device", "DL205")] -public sealed class DL205BcdQuirkTests(ModbusSimulatorFixture sim) -{ - [Fact] - public async Task DL205_BCD16_decodes_HR1072_as_decimal_1234() - { - if (sim.SkipReason is not null) Assert.Skip(sim.SkipReason); - if (!string.Equals(Environment.GetEnvironmentVariable("MODBUS_SIM_PROFILE"), "dl205", - StringComparison.OrdinalIgnoreCase)) - { - Assert.Skip("MODBUS_SIM_PROFILE != dl205 — skipping (standard profile does not seed HR[1072])."); - } - - var options = new ModbusDriverOptions - { - Host = sim.Host, - Port = sim.Port, - UnitId = 1, - Timeout = TimeSpan.FromSeconds(2), - Tags = - [ - new ModbusTagDefinition("DL205_Count_Bcd", - ModbusRegion.HoldingRegisters, Address: 1072, - DataType: ModbusDataType.Bcd16, Writable: false), - new ModbusTagDefinition("DL205_Count_Int16", - ModbusRegion.HoldingRegisters, Address: 1072, - DataType: ModbusDataType.Int16, Writable: false), - ], - Probe = new ModbusProbeOptions { Enabled = false }, - }; - await using var driver = new ModbusDriver(options, driverInstanceId: "dl205-bcd"); - await driver.InitializeAsync("{}", TestContext.Current.CancellationToken); - - var results = await driver.ReadAsync(["DL205_Count_Bcd", "DL205_Count_Int16"], - TestContext.Current.CancellationToken); - - results[0].StatusCode.ShouldBe(0u); - results[0].Value.ShouldBe(1234, "DL205 BCD register 0x1234 represents decimal 1234 per the DirectLOGIC convention"); - - results[1].StatusCode.ShouldBe(0u); - results[1].Value.ShouldBe((short)0x1234, "same register read as Int16 returns the raw 0x1234 = 4660 value — proves BCD path is distinct"); - } -} diff --git a/mbproxy/README.md b/mbproxy/README.md index 2e62fd5..bdf916f 100644 --- a/mbproxy/README.md +++ b/mbproxy/README.md @@ -19,24 +19,20 @@ src/Mbproxy/ Main C# project (net10.0, Microsoft.NET.Sdk.Worker) tests/Mbproxy.Tests/ xUnit v3 test project (314 unit + 48 E2E tests) install/ PowerShell install/uninstall scripts and config template docs/ Architecture, features, operations, reference, and testing docs -DL260/ DL205/DL260 reference material and pymodbus simulator profile ``` ## Resource index | Task | Go to | |---|---| -| End-to-end architectural design (entry point — routes into focused docs below) | [`docs/design.md`](docs/design.md) | -| Phase-by-phase implementation plan and history | [`docs/plan/README.md`](docs/plan/README.md) | -| Install, upgrade, uninstall, log file locations, first-install smoke checklist | [`docs/operations.md`](docs/operations.md) | -| Dashboard KPI catalog | [`docs/kpi.md`](docs/kpi.md) | -| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits) | [`DL260/dl205.md`](DL260/dl205.md) | -| pymodbus simulator profile (register seeds for E2E tests) | [`DL260/dl205.json`](DL260/dl205.json) | +| Architecture entry point — listener topology, request flow, per-PLC isolation | [`docs/Architecture/Overview.md`](docs/Architecture/Overview.md) | +| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits) | [`docs/Reference/dl205.md`](docs/Reference/dl205.md) | +| pymodbus simulator profile (register seeds for E2E tests) | [`tests/sim/dl205.json`](tests/sim/dl205.json) | | Agent-oriented coding guide (architecture bullets, device quirks, phase context) | [`CLAUDE.md`](CLAUDE.md) | ## Detailed documentation -The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/design.md) for the canonical end-to-end design; jump to the focused pages below when you need depth on one area. +The `docs/` tree is organized by topic. Start with [`Architecture/Overview.md`](docs/Architecture/Overview.md) for the end-to-end picture; jump to the focused pages below when you need depth on one area. ### Architecture @@ -106,7 +102,7 @@ Edit `src/Mbproxy/appsettings.json` to configure PLCs before running. The admin ## Install -Full detail is in [`docs/operations.md`](docs/operations.md). Quick path: +The `install/` directory holds the publish, install, and uninstall scripts. Quick path: ```powershell # 1. Publish (produces publish-out\self-contained\ and publish-out\framework-dependent\) @@ -126,5 +122,5 @@ Invoke-WebRequest http://localhost:8080/ -UseBasicParsing Documentation doctrine for this repo: [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md). - This README routes to deep docs — it does not duplicate them. -- Design decisions: [`docs/design.md`](docs/design.md) is the source of truth. +- Design decisions and rationale live in the `docs/` tree (Architecture, Features, Operations, Reference, Testing). - When the service's public surface or task→tool mapping changes, update this README and the root [`../CLAUDE.md`](../CLAUDE.md) index row. diff --git a/mbproxy/docs/Architecture/ConnectionModel.md b/mbproxy/docs/Architecture/ConnectionModel.md index 75fd10f..b355eae 100644 --- a/mbproxy/docs/Architecture/ConnectionModel.md +++ b/mbproxy/docs/Architecture/ConnectionModel.md @@ -4,7 +4,7 @@ The proxy holds one persistent backend TCP socket per PLC and multiplexes many u ## Why One Backend Connection Per PLC -An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../../DL260/dl205.md`](../../DL260/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop. +An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../Reference/dl205.md`](../Reference/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop. Multiplexing replaces 1:N upstream-to-backend with N:1 upstream-to-multiplexer-to-backend: @@ -244,4 +244,4 @@ The per-request timeout watchdog described above is the production defence again - [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events - [`../Testing/Simulator.md`](../Testing/Simulator.md) — pymodbus 3.13.0 deferred-handler quirk in detail -- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap +- [`../Reference/dl205.md`](../Reference/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap diff --git a/mbproxy/docs/Architecture/Overview.md b/mbproxy/docs/Architecture/Overview.md index f6f7819..ae8e37b 100644 --- a/mbproxy/docs/Architecture/Overview.md +++ b/mbproxy/docs/Architecture/Overview.md @@ -145,6 +145,4 @@ The simulator used by the end-to-end test suite — a `pymodbus`-based stand-in - [`../Operations/Configuration.md`](../Operations/Configuration.md) — `appsettings.json` schema and tag list shape. - [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — the Kestrel admin endpoint and counter catalog. - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — stable structured log event names. -- [`../design.md`](../design.md) — canonical design decisions and rationale. - [`../Testing/Simulator.md`](../Testing/Simulator.md) — `pymodbus` DL205 simulator used by the end-to-end suite. -- [`../plan/README.md`](../plan/README.md) — phase plan with per-phase test inventory. diff --git a/mbproxy/docs/Architecture/ResponseCache.md b/mbproxy/docs/Architecture/ResponseCache.md index 9a47202..5462d74 100644 --- a/mbproxy/docs/Architecture/ResponseCache.md +++ b/mbproxy/docs/Architecture/ResponseCache.md @@ -405,5 +405,3 @@ configuration described above. `mbproxy.cache.*` event catalogue with event IDs. - [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `pymodbus` DL205 stand-in used by the end-to-end cache tests. -- [`../design.md`](../design.md) — canonical design decisions and - rationale. diff --git a/mbproxy/docs/Features/BcdRewriting.md b/mbproxy/docs/Features/BcdRewriting.md index aab4f3a..02621f7 100644 --- a/mbproxy/docs/Features/BcdRewriting.md +++ b/mbproxy/docs/Features/BcdRewriting.md @@ -4,7 +4,7 @@ The BCD rewriter is the inline codec that translates DirectLOGIC's native Binary ## Why BCD Rewriting Exists -The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../../DL260/dl205.md`](../../DL260/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules. +The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../Reference/dl205.md`](../Reference/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules. Upstream consumers (Wonderware, Historian, OPC UA gateways, generic Modbus clients written against the standard) expect plain binary integers. Asking every consumer to BCD-decode the wire is brittle: each consumer would carry the same tag list, the same word-order quirks, and the same risk of drift. The rewriter centralises that translation so the rest of the world sees plain `Int16` / `Int32` and the proxy is the single source of truth for "which addresses are BCD." @@ -18,7 +18,7 @@ A 32-bit BCD value spans a register pair at `Address` and `Address+1` in CDAB (l - The register at `Address+1` holds the **high 4 BCD digits**. - Decoded decimal = `Decode16(high) * 10_000 + Decode16(low)`. -This follows directly from DirectLOGIC's CDAB word convention (see [`../../DL260/dl205.md`](../../DL260/dl205.md) → Word Order). +This follows directly from DirectLOGIC's CDAB word convention (see [`../Reference/dl205.md`](../Reference/dl205.md) → Word Order). Worked example — the register pair `[0x1234][0x5678]` reads on the wire as the low word `0x1234` first and the high word `0x5678` second: @@ -249,4 +249,4 @@ A few invariants the rewriter relies on and the test suite enforces: - [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) — diagnosing partial-overlap warnings - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.rewrite.*` event catalogue - [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `dl205.json` simulator profile that encodes BCD test fixtures -- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation +- [`../Reference/dl205.md`](../Reference/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation diff --git a/mbproxy/docs/Operations/StatusPage.md b/mbproxy/docs/Operations/StatusPage.md index 6cd1c5f..262d375 100644 --- a/mbproxy/docs/Operations/StatusPage.md +++ b/mbproxy/docs/Operations/StatusPage.md @@ -317,9 +317,9 @@ curl -s http://mbproxy-host:8080/status.json | Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format. -## Where the KPIs Live +## Scope of This Document -This document covers the **endpoint surface**: what is on the wire and how each field is computed. The **dashboard composition** — which counters roll up into which Grafana panels, alerting thresholds, fleet-aggregate definitions — lives in [`../kpi.md`](../kpi.md). Keep the two documents disjoint: when a new counter is added, list it here; when a new panel or rate calculation is added, add it to `kpi.md`. +This document covers the **endpoint surface**: what is on the wire and how each field is computed. When a new counter is added, list it here. ## Related Documentation @@ -331,4 +331,3 @@ This document covers the **endpoint surface**: what is on the wire and how each - [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys. - [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes. - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`. -- [`../kpi.md`](../kpi.md) — dashboard catalog that consumes these counters. diff --git a/mbproxy/docs/Operations/Troubleshooting.md b/mbproxy/docs/Operations/Troubleshooting.md index 8c07d45..3d65716 100644 --- a/mbproxy/docs/Operations/Troubleshooting.md +++ b/mbproxy/docs/Operations/Troubleshooting.md @@ -101,7 +101,7 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-.log`. The l Test-NetConnection -ComputerName -Port 502 ``` -2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `DL260/mbtcp_settings.JPG` for the as-deployed values). +2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `docs/Reference/mbtcp_settings.JPG` for the as-deployed values). 3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause. 4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request). diff --git a/mbproxy/DL260/dl205.md b/mbproxy/docs/Reference/dl205.md similarity index 89% rename from mbproxy/DL260/dl205.md rename to mbproxy/docs/Reference/dl205.md index b5ff16d..2c43c1c 100644 --- a/mbproxy/DL260/dl205.md +++ b/mbproxy/docs/Reference/dl205.md @@ -267,29 +267,3 @@ Test names: `DL205_5th_TCP_connection_refused`, `DL205_socket_closes_on_malformed_MBAP`. -## References - -1. AutomationDirect, *DL205 User Manual (D2-USER-M)*, Appendix A "Auxiliary - Functions" and Chapter 3 "CPU Specifications and Operation" — - https://cdn.automationdirect.com/static/manuals/d2userm/d2userm.html -2. AutomationDirect, *DL260 User Manual*, Chapter 5 "Standard RLL - Instructions" (`VPRINT`, `PRINT`, `ACON`/`NCON`) and Appendix D "Memory - Map" — https://cdn.automationdirect.com/static/manuals/d2userm/d2userm.html -3. Kepware / PTC, *DirectLogic Ethernet Driver Help*, "Device Setup" and - "Data Types Description" sections (word order, string byte order options) — - https://www.kepware.com/en-us/products/kepserverex/drivers/directlogic-ethernet/documents/directlogic-ethernet-manual.pdf -4. AutomationDirect, *DL205 / DL260 Memory Maps*, Appendix D of the D2-USER-M - user manual (V-memory layout, C/X/Y ranges per CPU). -5. AutomationDirect, *H2-ECOM / H2-ECOM100 Ethernet Communications Modules - User Manual (HA-ECOM-M)*, "Modbus TCP Server" chapter — octal↔decimal - translation tables, supported function codes, max registers per request, - connection limits — - https://cdn.automationdirect.com/static/manuals/hxecomm/hxecomm.html -6. Inductive Automation, *Ignition Modbus Driver — Address Mapping*, word - order options (ABCD/CDAB/BADC/DCBA) — - https://docs.inductiveautomation.com/docs/8.1/ignition-modules/opc-ua/drivers/modbus-v2 -7. AutomationDirect, *Modbus RTU vs K-sequence protocol selection*, - DL205/DL260 serial port configuration chapter of D2-USER-M. -8. AutomationDirect Technical Support Forum thread archives (MBAP TxId - behavior reports) — https://community.automationdirect.com/ (search: - "ECOM100 transaction id"). _Unconfirmed_ operator reports only. diff --git a/mbproxy/DL260/mbtcp_settings.JPG b/mbproxy/docs/Reference/mbtcp_settings.JPG similarity index 100% rename from mbproxy/DL260/mbtcp_settings.JPG rename to mbproxy/docs/Reference/mbtcp_settings.JPG diff --git a/mbproxy/docs/Testing/Simulator.md b/mbproxy/docs/Testing/Simulator.md index bb84c90..bc9d670 100644 --- a/mbproxy/docs/Testing/Simulator.md +++ b/mbproxy/docs/Testing/Simulator.md @@ -4,9 +4,9 @@ The pymodbus DL205 simulator stands in for real DL205/DL260 hardware in the E2E ## Why a Simulator -`mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup. +`mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../tests/sim/dl205.json`](../../tests/sim/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup. -The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../../DL260/dl205.md`](../../DL260/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation. +The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../Reference/dl205.md`](../Reference/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation. ## Harness Layout @@ -72,7 +72,7 @@ if (_sim.SkipReason is not null) Assert.Skip(_sim.SkipReason); ``` -The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The phase-01 gate (see [`../plan/README.md`](../plan/README.md)) explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass. +The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The unit-test suite's no-skip policy explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass. The skip reasons the fixture produces map cleanly onto the recovery action: @@ -146,7 +146,7 @@ The connection-model rationale for why the multiplexer produces multi-frame recv ## Simulator Profile -`DL260/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against: +`tests/sim/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against: | Address | Width | Seeded value | Used to prove | |---------|-------|--------------|---------------| @@ -155,7 +155,7 @@ The connection-model rationale for why the multiplexer produces multi-frame recv | HR 1072 | uint16 | `0x1234` (raw BCD nibbles) | Single-register FC03 BCD decode through the proxy | | HR 1080/1081 | uint16 pair | CDAB-ordered 32-bit BCD | 32-bit BCD decode across the word pair | -The full register map and the device-side rationale for each entry live in [`../../DL260/dl205.md`](../../DL260/dl205.md). +The full register map and the device-side rationale for each entry live in [`../Reference/dl205.md`](../Reference/dl205.md). Two profile-level settings are load-bearing for the harness: @@ -166,7 +166,7 @@ The `write` block in the JSON controls which ranges accept FC06/FC16. Writes out ## Alternate Profiles -The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. The existing pattern is `DL260/DL205BcdQuirkTests.cs`, which already drives the simulator with profile-driven assertions. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated. +The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated. ## Running the Simulator Standalone @@ -231,5 +231,6 @@ The read direction proves the proxy rewrote the response; the write direction pr - [Connection Model](../Architecture/ConnectionModel.md) — why the multiplexer's shared backend connection produces the multi-frame condition that triggers pymodbus's framer quirk - [Troubleshooting](../Operations/Troubleshooting.md) — hang-diagnosis pattern for tests that exceed their `[Fact(Timeout)]` - [Log Events](../Reference/LogEvents.md) — `mbproxy.multiplex.request.timeout` is the production watchdog against TxId mis-echo -- [DL205/DL260 device quirks](../../DL260/dl205.md) — device-side rationale for every register the simulator profile seeds -- [Phase plan README](../plan/README.md) — Test discipline section that codifies the 5 000 ms default and the `--blame-hang-timeout` rule +- [DL205/DL260 device quirks](../Reference/dl205.md) — device-side rationale for every register the simulator profile seeds + +Test discipline: E2E tests default to a 5 000 ms `[Fact(Timeout)]`, and `dotnet test` is run with `--blame-hang-timeout` to capture a dump on any hang. diff --git a/mbproxy/docs/design.md b/mbproxy/docs/design.md deleted file mode 100644 index 45c7378..0000000 --- a/mbproxy/docs/design.md +++ /dev/null @@ -1,306 +0,0 @@ -# mbproxy — design plan - -Architectural design for the `mbproxy` Modbus TCP proxy service: how it fronts ~54 AutomationDirect DirectLOGIC DL205/DL260 controllers, rewrites BCD tags bidirectionally inline, and recovers from listener and backend failures. Settled in a design Q&A on 2026-05-13. - -**Status:** plan; no code yet. Each decision below is load-bearing — change deliberately, not by drift. - -Context (what the service does and why it exists) lives in [`../CLAUDE.md`](../CLAUDE.md) under "What this is" and "Purpose: bidirectional BCD rewrite". This file is the *how*. Device quirks the design depends on live in [`../DL260/dl205.md`](../DL260/dl205.md). - -Runtime shape: **.NET 10 Generic Host** worker service registered as a **Windows Service** via `Microsoft.Extensions.Hosting.WindowsServices`. - -## Listener topology — per-PLC port (one port → one PLC) - -The host opens **one `TcpListener` per PLC** on a distinct port. Upstream clients reach a specific PLC by connecting to its assigned proxy port; no protocol-level routing is needed. - -``` -Client A ──┐ -Client B ──┼──→ proxy:5020 ──→ PLC #1 (10.0.1.1:502) - ├──→ proxy:5021 ──→ PLC #2 (10.0.1.2:502) - │ ... - └──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502) -``` - -## Connection model — single backend socket per PLC, multiplexed via MBAP TxId rewriting - -Each PLC has **one persistent backend TCP socket**, owned by a `PlcMultiplexer`. Many upstream client connections share that single backend socket; the multiplexer distinguishes their in-flight requests by **rewriting the MBAP transaction ID** on each request and restoring each client's original TxId on the matching response. Implemented in [Phase 09](plan/09-txid-multiplexing.md); replaced the prior 1:1 per-upstream-client backend-socket model. - -``` -Client A ─┐ -Client B ─┼─→ proxy:5020 ─[ PlcMultiplexer ]─→ PLC #1 (10.0.1.1:502) -Client C ─┘ │ (one persistent socket) - ▼ - CorrelationMap[proxyTxId] - TxIdAllocator (16-bit space) -``` - -- **Upstream → multiplexer**: each accepted upstream socket is wrapped in an `UpstreamPipe` (read loop + bounded response channel). The pipe's read loop hands every parsed MBAP frame to the multiplexer's `OnUpstreamFrameAsync`, which allocates a free 16-bit `proxyTxId`, stores an `InFlightRequest` in a `CorrelationMap` keyed by that proxyTxId, BCD-rewrites the request payload, overwrites the MBAP header's TxId field with `proxyTxId`, and enqueues the frame into the per-PLC outbound channel. -- **Multiplexer → backend**: a single backend writer task drains the outbound channel and sends each frame to the PLC over the shared socket. A single backend reader task reads MBAP frames back, looks each up by `proxyTxId` in the correlation map, BCD-rewrites the response, restores each interested party's original TxId, and routes the frame to that party's `UpstreamPipe._responseChannel`. The single-writer / single-reader invariant on the backend socket eliminates the need for socket-level synchronisation. -- **Per-request timeout watchdog**: a periodic task scans the correlation map at a quarter of `Connection.BackendRequestTimeoutMs` and times out any in-flight request whose response has not arrived. Timed-out requests get a Modbus exception 0x0B (Gateway Target Device Failed To Respond) delivered to their upstream party and free their allocator slot. Without this watchdog, a single lost or mis-routed response would leak a correlation entry forever and hang the upstream pipe indefinitely. - -**Operational consequence (replaces the prior 4-client warning).** The H2-ECOM100's 4-concurrent-TCP-client cap (see [`../DL260/dl205.md`](../DL260/dl205.md) → Behavioral Oddities) no longer limits upstream-side connection count — the proxy holds exactly one slot per PLC regardless of how many upstream clients are attached. The wire-rate ceiling is unchanged (the ECOM internally serializes requests at ~2–10 ms per scan); the multiplexer shifts where serialization happens (proxy outbound queue vs PLC accept queue) rather than adding throughput. - -> ⚠ **Backend disconnect cascades upstream.** When the backend socket dies (PLC reboot, network partition, middlebox idle drop), the multiplexer closes every attached upstream pipe in the same cycle and increments `BackendDisconnectCascades` by the upstream count. Clients reconnect on their own next request and the multiplexer Polly-reconnects to the backend on the first upstream frame. - -> ⚠ **pymodbus 3.13.0 simulator quirk (test-only).** The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` per connection and schedules deferred handlers via `asyncio.call_soon`. Two MBAP frames arriving in the same recv buffer (as the multiplexer can produce on its shared backend connection) overwrite `last_pdu` before the first handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not suffer this — it echoes per-request TxIds correctly. Multiplexer correctness under truly concurrent backend traffic is therefore proved against a stub backend in `PlcMultiplexerTests`; the E2E suite paces requests to keep pymodbus in known-good single-PDU mode. The per-request watchdog is the production defence against any backend (real or simulated) that mis-echoes a TxId. - -## Configuration — single `appsettings.json` - -All configuration lives in one file, loaded via `Microsoft.Extensions.Configuration` and bound to typed POCOs. No sidecar YAML/CSV. - -```jsonc -{ - "Mbproxy": { - "BcdTags": { - "Global": [ - { "Address": 1072, "Width": 16 }, - { "Address": 1080, "Width": 32 } - ] - }, - "Plcs": [ - { - "Name": "Line1-Mixer", - "ListenPort": 5020, - "Host": "10.0.1.1", - "BcdTags": { - "Add": [ { "Address": 1200, "Width": 32 } ], - "Remove": [ 1080 ] - } - }, - { "Name": "Line1-Conveyor", "ListenPort": 5021, "Host": "10.0.1.2" } - // ... 54 PLC rows - ], - "AdminPort": 8080, - "Connection": { - "BackendConnectTimeoutMs": 3000, - "BackendRequestTimeoutMs": 3000 - }, - "Resilience": { - "BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] }, - "ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 } - }, - "Cache": { - "AllowLongTtl": false, // gate for any tag CacheTtlMs > 60_000 - "MaxEntriesPerPlc": 1000, - "EvictionIntervalMs": 5000 - } - } -} -``` - -A BCD tag may optionally carry `CacheTtlMs` (default 0 = off); a `PlcOptions` entry may optionally carry `DefaultCacheTtlMs` (default 0 = off). Resolution order: explicit per-tag → per-PLC default → 0. - -**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must: - -- reject duplicate addresses within a single PLC's resolved list -- reject 32-bit entries that would have their high register overlap a separate 16-bit entry -- warn on `Remove` entries that don't match any global tag (probably stale config) - -## Configuration hot-reload - -`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`, and all consumers read via `IOptionsMonitor` so a save to the config file propagates without restarting the service. Each change kind has explicit reconcile semantics: - -| Change in appsettings | Propagation | -|-----------------------|-------------| -| `BcdTags.Global` add/remove/width | Rewriter dereferences the monitor per-PDU. Next PDU sees the new map; in-flight reads/writes are not retroactively touched. | -| `Plcs[i].BcdTags.{Add,Remove}` | Same — next-PDU resolution. | -| New `Plcs[i]` entry | Listener supervisor binds the new port subject to the same eager-then-auto-recover policy. | -| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. | -| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. | -| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. | -| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` (Phase 11) | Tag-map reseat for the affected PLC drops the entire PLC cache; entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented in v1. | -| `Cache.AllowLongTtl`, `Cache.MaxEntriesPerPlc`, `Cache.EvictionIntervalMs` (Phase 11) | `AllowLongTtl` is enforced on next reload-validation; `MaxEntriesPerPlc` applies to subsequent inserts (existing entries not pruned); `EvictionIntervalMs` is read by each fresh eviction loop. | -| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. | - -Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta. - -## BCD tag shape - -```csharp -public sealed record BcdTag(ushort Address, byte Width); // Width ∈ { 16, 32 } -``` - -- **16-bit BCD** — one register holds 4 BCD digits (0–9999). Wire value `0x1234` decodes to decimal 1234. -- **32-bit BCD** — a CDAB-ordered register pair at `Address` and `Address+1`. The register at `Address` holds the **low 4 digits**; the register at `Address+1` holds the **high 4 digits**. Decoded decimal = `high * 10000 + low`. This follows directly from DirectLOGIC's CDAB word order (see [`../DL260/dl205.md`](../DL260/dl205.md) → Word Order). -- **Unsigned only.** DL205/DL260 BCD is non-negative in the default ladder pattern; the proxy does not implement signed BCD. -- **Holding-register and input-register addresses share the same space.** The rewriter applies the configured tag list against both FC03 and FC04 reads. - -## Read coalescing (Phase 10) - -After Phase 10, FC03 / FC04 requests are additionally subject to **in-flight read coalescing** before they reach the backend. When two or more upstream clients send the same `(unitId, fc, startAddress, qty)` tuple within the in-flight window of an already-routed request, the multiplexer attaches each late arrival to the existing `InFlightRequest.InterestedParties` list instead of opening a second backend round-trip. The single backend response is fanned out to every attached party with each party's original MBAP TxId restored individually. - -Properties: - -- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. Coalescing alone is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision. (Phase 11 layers an opt-in cache on top — see "Response cache" below.) -- **Only FC03 / FC04.** Writes (FC06 / FC16) are non-idempotent on BCD tags and never coalesced. Different function codes never share a `CoalescingKey` even at the same address (FC03 and FC04 read different Modbus tables). Different `unitId` bytes never coalesce (different PLC personalities behind a shared socket). -- **Bounded fan-out via `MaxParties`** (default 32 in `Mbproxy.Resilience.ReadCoalescing.MaxParties`). Once an entry has `MaxParties` interested clients, the next arrival opens a fresh entry — bounds the response-fanout cost per entry at O(MaxParties) and shields the backend reader task from pathological pile-on. -- **Hot-reloadable on/off.** `Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. Flipping it to `false` at runtime leaves running coalesced entries to drain naturally; subsequent FC03/04 requests take the Phase-9 (one round-trip per upstream request) path. -- **Transparency contract preserved.** Each upstream client still sees its own original MBAP TxId on the response. The BCD rewriter runs once on the shared response buffer; per-party copies are only made when fan-out has more than one party. - -Counter accounting balance (per snapshot): `coalescedHitCount + coalescedMissCount` equals the total FC03 + FC04 requests seen since the multiplexer was constructed. Both counters increment regardless of whether the coalescing feature is enabled — `coalescedHitCount` is 0 when disabled, but every read still increments `coalescedMissCount`. **Saturation paths (allocator full, duplicate-key race) also count as a miss** even though they produce no backend round-trip — the identity above is preserved by counting every entry into the coalescing path, not every backend send. Operators wanting "actual backend round-trips opened" should subtract the multiplexer's exception-04 frames produced from saturation. - -## Response cache (Phase 11) — opt-in bounded-staleness cache - -**⚠ Design-contract pivot.** Through Phase 10 the proxy is *purely transparent* — every upstream read corresponds 1:1 to a recent backend round-trip (or, with Phase 10, to a peer's in-flight backend round-trip in the same microseconds-to-milliseconds window). Phase 11 changes that contract: the proxy gains an **opt-in per-tag response cache** that may serve upstream FC03/FC04 reads from in-process memory with bounded staleness up to the operator-configured `CacheTtlMs`. **The cache is OFF by default** (`CacheTtlMs = 0` on every BCD tag unless explicitly set); a fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. Operators opt tags in explicitly as their acknowledgement of the staleness window. - -### Cache contract - -- **Per-tag TTL.** Each BCD tag carries an optional `CacheTtlMs` (in `BcdTagOptions`). `CacheTtlMs = 0` (the default) disables caching for that tag. The TTL resolution order is **explicit per-tag → per-PLC `DefaultCacheTtlMs` → 0**. -- **Multi-tag read range: effective TTL = `min(TTLs)`.** When a single FC03/FC04 read covers multiple configured tags, the cache uses the smallest TTL among them. If any tag in the read range has `CacheTtlMs = 0`, the **whole read is uncached** — the conservative-by-design choice. -- **Lookup order: cache → coalesce → backend.** A cache hit short-circuits Phase 10's coalescing entirely. Only on a miss does the request engage coalescing (Phase 10) and then the Phase 9 backend send path. -- **Cache populates on demand only.** No polling, no predictive prefetch. Entries are created in the backend reader task **after** the BCD rewriter has run on the response — the cache stores **POST-rewriter bytes**, so hits never re-invoke the rewriter (CPU win + behaviour-stable). -- **Write invalidation by ADDRESS RANGE OVERLAP.** A successful FC06 / FC16 response (non-exception) invalidates every cached FC03/FC04 entry whose address range `[StartAddress, StartAddress + Qty)` overlaps the write range. A write to register 105 invalidates a cached `[100..110]` read but not a cached `[200..210]` read. Exception responses do not invalidate (the write didn't take effect). -- **Different unit IDs never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3,4})`. -- **Cache survives backend disconnects.** A cached entry's data was valid when stored; a disconnect does not retroactively invalidate it. Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid). -- **No persistence.** Process restart wipes the cache. No file/Redis backing store, no last-known-good snapshot. -- **Hot-reload flushes the entire PLC cache.** Any tag-list change to a PLC drops every cached entry for that PLC. Per-tag flush granularity is intentionally not done in v1 — the simple correctness move is "any tag-list reload → drop all entries for the affected PLC and let them re-populate." -- **TTL > 60 s requires `Cache.AllowLongTtl = true`.** Validation rejects reloads that set `CacheTtlMs > 60_000` without this opt-in. Prevents "left at 1 hour by accident" deployments. -- **LRU-bounded capacity.** Each PLC's cache is capped at `Cache.MaxEntriesPerPlc` (default 1000). When full, the next insert evicts the least-recently-used entry. A background eviction loop (interval `Cache.EvictionIntervalMs`, default 5000) also scans for expired entries. - -### Cache and the rewriter - -The BCD rewriter runs once on the cache-miss path (the backend reader task decodes the response and stores the decoded bytes in the cache). Cache hits return pre-decoded bytes directly without re-invoking the rewriter — this is both a CPU optimisation and a correctness guarantee (any future rewriter change would not retroactively re-transform an entry that was decoded against an earlier rewriter version). - -### Hot-reload semantics - -| Change | Cache behaviour | -|--------|----------------| -| Tag's `CacheTtlMs` changed (any direction, 0 → N, N → 0, N → M) | Entire PLC cache is flushed; entries re-populate on demand under the new TTL. | -| New PLC added / removed | New PLC starts with empty cache; removed PLC's cache is discarded with the multiplexer. | -| `Cache.AllowLongTtl` flipped | Validation runs on next reload; existing entries unaffected. | -| `Cache.MaxEntriesPerPlc` changed | Existing entries unaffected; cap applies to subsequent inserts. | -| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues until next dispose; subsequent loops use new interval. | - -### Counter accounting - -- `cacheHitCount` — FC03/FC04 requests served from the cache. -- `cacheMissCount` — FC03/FC04 requests that fell through to the coalescing/backend path. (Cache hit + Cache miss = total FC03/FC04 requests that were cache-eligible, i.e. whose resolved TTL was > 0; reads whose effective TTL is 0 increment neither.) **A "miss" does NOT mean "produced a backend round-trip."** Two upstream peers issuing the same cache-eligible read both increment `cacheMissCount`; one then opens a backend round-trip and the other coalesces onto it via the InFlightByKey path (incrementing `coalescedHitCount`). Operators reading these counters as "backend reads opened" should use `cacheMissCount − coalescedHitCount` as the lower bound on actual backend traffic. -- `cacheInvalidations` — count of cache entries invalidated by FC06/FC16 write responses. -- `cacheEntryCount` — point-in-time snapshot of `ResponseCache.Count` (Tier-2 memory-watch KPI). -- `cacheBytes` — point-in-time approximation of cached PDU bytes (Tier-2 memory-watch KPI). - -## Rewriter — function code scope - -The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte: - -| FC | Direction | Action | -|----|----------------|-----------------------------------------------------------------------| -| 03 | request + response | FC03 requests may be coalesced with peers before reaching the backend (see Phase-10 section above); response re-encodes covered BCD slots from raw nibbles → binary integer | -| 04 | request + response | Same coalescing eligibility as FC03; response re-encoding the same as FC03 (input-register table also surfaces V-memory) | -| 06 | request | Re-encode binary integer → BCD nibbles before forwarding | -| 06 | response | Decode BCD nibbles → binary integer on the echo (clients validate that the echoed value equals the value they sent; without this, NModbus-style clients throw on the round-trip) | -| 16 | request | Per-register over the configured slots, then forward | - -**Partial-overlap policy.** A request that touches only ONE register of a configured 32-bit BCD pair (qty=1 at the low addr, or any read/write of the high addr alone) **passes through raw** with a `mbproxy.rewrite.partial_bcd` warning. The proxy never synthesises a Modbus exception for a partial-overlap — that response code is reserved for transport failure. - -## Failure modes — transparent pass-through with Polly-bounded backend connect - -- **PLC returns a Modbus exception (codes 01–04)** → forward verbatim with the original MBAP transaction ID. The client sees the real DL205/DL260 exception. -- **Backend connect refused or initial connect timeout** → retry under a Polly resilience pipeline: 3 attempts at 100ms / 500ms / 2000ms backoff (tuned via `Resilience.BackendConnect`). If all attempts fail, the multiplexer closes the upstream client connection that triggered the connect. -- **Backend mid-stream broken socket** → the multiplexer's reader/writer task throws; the backend tear-down path cancels both tasks, drains the correlation map, and **cascades the disconnect by closing every attached upstream pipe**. The next upstream request to any pipe triggers a fresh backend connect through the Polly pipeline. `BackendDisconnectCascades` counter records the upstream-pipe count at each cascade event. -- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path. -- **Partial-BCD overlap** → forward raw + warn (see Rewriter section). -- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients. -- **Cache during backend recovery (Phase 11).** Cache hits remain valid during a `recovering` listener state — the data was fresh when cached, and recovery only affects future requests. Writes that arrive during recovery never reach the backend, so the invalidation never happens. This is consistent: the write also didn't take effect on the PLC. Cached entries simply remain until their TTL expires. - -## Startup posture — eager, continue on per-port failure - -At startup the host attempts to bind **all 54 listen sockets up front**. Each failure (port already in use, invalid IP, malformed PLC entry) is logged at Error and handed off to the listener supervisor (next section). The service proceeds with whichever PLCs bound on the first attempt; the rest converge in the background. Monitoring should alert on `mbproxy.startup.bind.failed` so missing PLCs aren't silently dropped, and watch for `mbproxy.listener.recovered` to confirm late binds eventually succeeded. - -## Listener auto-recovery (Polly-backed supervisor) - -Each PLC's listener runs under a **supervisor task** that owns its bind lifecycle. If a bind fails at startup, or if a listener faults at runtime (port stolen by another process, transient OS network reset), the supervisor reattempts via a Polly retry pipeline: 5 attempts at 1s / 2s / 5s / 15s / 30s backoff, then steady-state retries every 30s indefinitely (tuned via `Resilience.ListenerRecovery`). Each attempt logs at Debug; the bind that finally succeeds emits one `mbproxy.listener.recovered` Information event. - -While a supervisor is between attempts, the corresponding PLC is reported as `listener.state = recovering` on the status page. Hot-reload uses the same supervisor to bring newly-added PLCs online and to tear down removed ones — there is exactly one code path for "bring up a listener" and one for "shut a listener down." - -## Logging — Serilog, structured, console + rolling file - -Serilog wired through the Microsoft.Extensions.Logging bridge: - -- **Console sink** for interactive `--console` runs. -- **Rolling-file sink** under `%ProgramData%\mbproxy\logs\`. -- **Windows Event Log sink** for Error+ events when the service is running under `Microsoft.Extensions.Hosting.WindowsServices`. -- **Default level** Information. Properties (`Plc`, `RemoteEp`, etc.) are emitted per message via `[LoggerMessage]` templates so log lines are greppable across the fleet. - -Event names follow the convention `mbproxy..[.]` and are part of the operator contract — once shipped they don't churn (renames require a major version bump). The full catalog of stable event names, their levels, properties, and operator implications lives in [`Reference/LogEvents.md`](Reference/LogEvents.md); each `*LogEvents.cs` static class (e.g. `MultiplexerLogEvents`, `CoalescingLogEvents`, `CacheLogEvents`, `RewriterLogEvents`) is the source of truth. - -## Status page — read-only HTTP endpoint - -A separate **Kestrel-hosted minimal API** runs on `Mbproxy.AdminPort` (default `8080`, distinct from the Modbus listen ports). The endpoint set is intentionally narrow — read-only telemetry; **no admin actions** (kick client, force reload, restart listener) are exposed: - -- `GET /` — single self-contained HTML page rendering a table of all configured PLCs with their state and live counters. Auto-refreshes every 5s via a meta-refresh tag (no JS bundle, no external assets). -- `GET /status.json` — the same data as JSON for monitoring scrapers. - -Authentication is assumed to live at the network layer (trusted internal segment behind a firewall). Surface that assumption in deployment docs when they exist. - -**Service-wide fields:** - -| Field | Meaning | -|-------|---------| -| `service.uptime` | Seconds since service start | -| `service.version` | Assembly informational version | -| `service.config.lastReloadUtc` | Timestamp of last accepted hot-reload (or `null`) | -| `service.config.reloadCount` | Number of reloads accepted since start | -| `service.config.reloadRejectedCount` | Number of reloads rejected since start | -| `listeners.bound` / `listeners.configured` | Bound listener count vs configured PLC count | - -**Per-PLC fields** (one row per `Plcs[i]`): - -| Field | Meaning | -|-------|---------| -| `name`, `host`, `listenPort` | Identity from config | -| `listener.state` | `bound` / `recovering` / `stopped` | -| `listener.lastBindError` | Most recent bind failure message (when `recovering`) | -| `listener.recoveryAttempts` | Polly retry count since last successful bind | -| `clients.connected` | Currently connected upstream client count | -| `clients.remoteEndpoints` | Array of `{ remote, connectedAtUtc, pdusForwarded }` | -| `pdus.forwarded` | Total PDUs (request+response) forwarded since start | -| `pdus.byFc` | `{ fc03, fc04, fc06, fc16, other }` request counts | -| `pdus.rewrittenSlots` | Count of register slots BCD-rewritten | -| `pdus.partialBcdWarnings` | Count of partial-overlap pass-throughs | -| `backend.connects.success` / `backend.connects.failed` | Polly-final-result counters | -| `backend.exceptions.byCode` | `{ "01": n, "02": n, "03": n, "04": n }` | -| `backend.lastRoundTripMs` | EWMA of recent successful round-trip times | -| `backend.coalescedHitCount` | FC03/04 requests that attached to an already-in-flight peer (Phase 10) | -| `backend.coalescedMissCount` | FC03/04 requests that opened a fresh backend round-trip (Phase 10). `Hit + Miss` = total FC03/04 requests | -| `backend.coalescedResponseToDeadUpstream` | Coalesced fan-out responses skipped because the attached upstream had already disconnected (Phase 10) | -| `backend.cacheHitCount` | FC03/04 reads served from the response cache (Phase 11) | -| `backend.cacheMissCount` | FC03/04 reads that fell through to coalescing/backend after a cache miss (Phase 11) | -| `backend.cacheInvalidations` | Cache entries invalidated by overlapping FC06/FC16 write responses (Phase 11) | -| `backend.cacheEntryCount` | Point-in-time snapshot of the per-PLC cache's entry count (Phase 11, Tier-2 memory-watch) | -| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC (Phase 11, Tier-2 memory-watch) | -| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction | - -Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path. - -## Test simulator — pymodbus DL260/DL205 server - -The pymodbus profile at [`../DL260/dl205.json`](../DL260/dl205.json) already models the DL205/DL260 quirks (BCD nibbles at known addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings, etc.) as concrete register seeds. The test infrastructure wraps it as a managed lifecycle so every integration / e2e test gets a fresh known-good DL-series target without needing real hardware. - -Harness shape (lives under `tests/sim/`): - -- **Launcher script** — `tests/sim/run-dl205-sim.ps1` provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`), then launches `pymodbus.server` with the `dl205.json` profile on a configurable port. Idempotent: re-runs reuse the venv. -- **xUnit fixture** — `Mbproxy.Tests.Sim.DL205SimulatorFixture : IAsyncLifetime` that: - - `InitializeAsync`: spawns the simulator subprocess, polls `TcpClient.ConnectAsync` against the port until success or a 10 s deadline, captures stdout/stderr to test output. - - `DisposeAsync`: signals graceful shutdown (Ctrl-C on the process group on Windows), then `Process.Kill(entireProcessTree: true)` as a safety net. - - Exposes `Host`, `Port`, `LogTail` (last N lines of sim stderr for diagnosis). -- **Test collection** — `[CollectionDefinition(nameof(DL205SimulatorCollection))]` so the fixture is shared across all integration/e2e classes that opt in (cheap startup, expensive process churn). -- **Skip policy** — if Python or pymodbus isn't available and the auto-provision fails (no network, locked-down CI image, etc.), `InitializeAsync` records the reason and tests skip via `Assert.Skip(sim.SkipReason)`. CI must have Python 3.10+ available; local devs running only the rewriter unit tests need nothing extra. -- **Alternate profiles** — additional scenarios (e.g., a profile that seeds a specific partial-overlap test case, or a profile with strict `type exception: true` to verify the proxy doesn't depend on lax pymodbus behaviour) live alongside `dl205.json` and are selected via `MODBUS_SIM_PROFILE` env var, matching the pattern already established by [`../DL260/DL205BcdQuirkTests.cs`](../DL260/DL205BcdQuirkTests.cs). - -The simulator IS the proxy's end-to-end test bed. A standard e2e test does: - -1. Start the simulator at `127.0.0.1:`. -2. Configure the proxy with one PLC entry `Host=127.0.0.1, Port=, ListenPort=`. -3. Start the proxy (in-process via `WebApplicationFactory`-style host construction). -4. Drive a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `127.0.0.1:`. -5. Assert two directions: - - **Read**: client sees the BCD-decoded integer (proxy rewrote the response). - - **Write**: simulator's register state shows the BCD-encoded nibbles (proxy rewrote the request). - -## Testing - -- **Unit tests** — drive the BCD rewriter with synthetic Modbus PDU byte arrays. No network, no simulator. Cover every FC03/04/06/16 × {single 16-bit, full 32-bit pair, partial-overlap low, partial-overlap high, mixed-with-non-BCD} cell. -- **Integration tests** — drive the proxy end-to-end against the pymodbus simulator described in the previous section, using a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `proxy:` and asserting the decoded value rather than the raw register bytes. -- **Auto-recovery tests** — bind a `TcpListener` on a target port BEFORE starting the proxy, assert that the supervisor enters `recovering` state, release the port, and assert the next supervisor attempt succeeds and `mbproxy.listener.recovered` fires. Also cover the runtime-fault path by forcing the accept loop to throw and asserting the supervisor reattempts. -- **Hot-reload tests** — write a temp `appsettings.json`, start the host, mutate the file (add a PLC, remove a PLC, change a global tag width), and assert: (a) supervisor adds/removes the affected listener, (b) the rewriter on the next PDU reflects the new tag map, (c) a malformed reload is rejected without breaking the running config. Cover both `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` paths. -- **Status page tests** — start the host, induce known events (connect 2 clients, force a backend exception, trigger a partial-BCD warning), and assert `GET /status.json` returns the expected counters. The HTML page is verified separately as a smoke test that the route returns 200 with `text/html`. diff --git a/mbproxy/docs/kpi.md b/mbproxy/docs/kpi.md deleted file mode 100644 index f332312..0000000 --- a/mbproxy/docs/kpi.md +++ /dev/null @@ -1,408 +0,0 @@ -# mbproxy — Dashboard KPI catalogue - -Recommended additions to the `/status.json` and `/` admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's `/status.json` exposes raw cumulative counters; this doc describes what's typically *also* expected when those counters land in Grafana / Wonderware / a custom HMI. - -**Scope.** This is a proposal, not a contract. The endpoint shape settled in [`design.md`](design.md) → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect. - -**Reading guide.** Each KPI has: -- **Name** — short identifier matching the proxy's existing camelCase convention. -- **Definition** — what the number means. -- **Source** — where the value comes from (existing counter, new counter, derived). -- **Widget** — typical dashboard visualisation. -- **Alert** — common threshold or anomaly rule (where applicable). -- **Effort** — implementation cost in hours (rough order-of-magnitude). - -## What's exposed today (recap) - -For context — every recommended addition below is *in addition to* this list. Today's `/status.json` carries: - -| Group | Fields | -|-------|--------| -| Service | `uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount` | -| Listeners | `bound`, `configured` | -| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` | -| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) | -| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` | -| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` | -| Per-PLC bytes | `upstreamIn`, `upstreamOut` | - -Counters are **cumulative since process start**. A restart resets them. - ---- - -## Tier 1 — strongly recommended for production - -These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard." - -### 1.1 Rate metrics (per-PLC and fleet-wide) - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `pdus.ratePerSec.last1m` | PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h | -| `pdus.ratePerSec.last5m` | Same over 5 min | Same buffer at 300 s | Sparkline | None | shared | -| `errors.ratePerMin` | Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute | Derived | Stat tile per PLC | > 10/min → page | 2 h | -| `bytes.ratePerSec.up` / `.down` | Bandwidth each direction | Derived from `bytesUpstreamIn/Out` deltas | Stacked area | None — informational | 2 h | -| `fleet.totalPdusPerSec` | Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h | - -**Why this matters.** Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing `rate(pdus_forwarded[1m])` on a 54-row fleet is the single most informative widget on the dashboard. - -**Implementation note.** Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in `/status.json` directly, add a per-PLC `Mbproxy.Proxy.RateTracker` with a fixed-size circular buffer of 60 one-second samples and expose `RatePerSec1m`, `RatePerSec5m`. - -### 1.2 Latency percentiles (replacing the bare EWMA) - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `backend.roundTripMs.p50` | Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h | -| `backend.roundTripMs.p95` | 95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared | -| `backend.roundTripMs.p99` | 99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared | -| `backend.roundTripMs.max1m` | Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared | - -**Why this matters.** The existing `lastRoundTripMs` is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently. - -**Implementation note.** Use `Mbproxy.Proxy.LatencyReservoir` — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much). - -### 1.3 Per-PLC availability ratio - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `listener.boundRatio.last1h` | Fraction of time in `bound` state over last hour | New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h | -| `listener.boundRatio.sinceStart` | Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared | -| `listener.timeInRecoveringMs.last1h` | Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared | - -**Why this matters.** `recoveryAttempts` tells you how many times something has flapped, but not how *much* downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly. - -**Implementation note.** Each `PlcListenerSupervisor` already has a state machine. Add a `StateDurationTracker` that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window. - -### 1.4 Liveness / staleness signals - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `pdus.lastForwardedUtc` | Wall time of the most recent forwarded PDU | New `_lastForwardedTimestamp` per PLC | Stat tile | `now - value > 5 min AND clients.connected > 0` → page | 1 h | -| `clients.lastActivityUtc` | Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h | -| `staleClients.count` | Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h | - -**Why this matters.** Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with `clients.connected = 2` but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured. - -### 1.5 Service-wide fleet aggregates - -These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard. - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `fleet.plcsHealthy` | Count of PLCs in `bound` state with no errors in last 5 min | Aggregate | Big number, green | < `listeners.configured - 2` → warn | 2 h | -| `fleet.plcsRecovering` | Count in `recovering` state | Aggregate | Big number, orange | > 0 → informational | shared | -| `fleet.plcsStopped` | Count in `stopped` state | Aggregate | Big number, grey | > 0 → page | shared | -| `fleet.plcsWithActiveErrors` | Count with `errors.ratePerMin > 0` | Aggregate | Big number, red | > 0 → page | shared | -| `fleet.totalClientsConnected` | Sum of `clients.connected` | Aggregate | Stat tile | None | 1 h | -| `fleet.totalRewrittenSlotsPerSec` | Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared | - -**Why this matters.** A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table. - -### 1.6 Multiplexer state — **shipped in [Phase 9](plan/09-txid-multiplexing.md)** - -The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth. - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `backend.inFlightCount` | Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) | -| `backend.maxInFlight` | Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) | -| `backend.txIdWraps` | Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) | -| `backend.queueDepth` | Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) | -| `backend.disconnectCascades` | Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with `mbproxy.backend.failed` events | (in Phase 9 scope) | - -**Why this matters.** Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's `lastRoundTripMs` measures wire latency only; queue depth reveals proxy-side backlog). - -### 1.7 Read coalescing — **shipped in [Phase 10](plan/10-read-coalescing.md)** - -Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. `coalescedHitCount + coalescedMissCount` equals total FC03/04 request count per snapshot — the math always balances. - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `backend.coalescedHitCount` | FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) | -| `backend.coalescedMissCount` | FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) | -| `backend.coalescingRatio` | `Hit / (Hit + Miss)` over the trailing window | Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) | -| `backend.coalescedResponseToDeadUpstream` | Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) | - -**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric. - -### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)** - -After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry). - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `backend.cacheHitCount` | FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) | -| `backend.cacheMissCount` | FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) | -| `backend.cacheHitRatio` | `Hit / (Hit + Miss)` for cache-eligible reads | Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) | -| `backend.cacheInvalidations` | Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) | - -**Why this matters.** Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts. - ---- - -## Tier 2 — nice-to-have - -Reach for these once Tier 1 is solid. They add depth for specific operational scenarios. - -### 2.1 Connection-cap saturation warning - -> **Status: superseded by [Phase 9](plan/09-txid-multiplexing.md).** This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is `backend.inFlightCount` (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. **Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.** - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `clients.atCapWarning` | Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap) | Derived | Cell highlight | True → warn | 1 h | -| `clients.atCapBlocked` | Boolean: `clients.connected >= 4` (cap reached) | Derived | Cell highlight | True → page | shared | - -**Why this mattered (pre-Phase-9).** The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see [design.md](design.md) → "Connection model" and [DL260/dl205.md](../DL260/dl205.md) → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments. - -### 2.2 Error breakdown / heatmap - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `partialBcd.byClient` | Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h | -| `invalidBcd.byAddress` | Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h | -| `exceptions.byCodeRate` | Per-exception-code rate over 5 min | Derived from `exceptionsByCode.*` | Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h | - -**Why this matters.** Once you've seen `partialBcdWarnings = 1247`, the next question is *which client* and *which tag*. Without dimensional breakdown, you have to ssh into the log file to find out. - -### 2.3 Hot-reload cadence - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `config.reloadsPerHour` | Reload events per hour | Derived from `configReloadCount` | Sparkline | > 10/hr → unusual; misconfig loop? | 1 h | -| `config.lastReloadDelta` | Summary of what changed on last reload | Already in `mbproxy.config.reload.applied` event; surface here | Text snippet | None — informational | 2 h | - -**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured. - -### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)** - -When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob. - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) | -| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) | - -**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression. - -### 2.4 Memory / process health - -| KPI | Definition | Source | Widget | Alert | Effort | -|-----|------------|--------|--------|-------|--------| -| `process.workingSetMb` | `Process.GetCurrentProcess().WorkingSet64 / 1MB` | New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h | -| `process.gcCollections.gen0/1/2` | GC counts per generation | `GC.CollectionCount(n)` | Sparkline | Gen-2 frequency → memory pressure | 0.5 h | -| `process.threadCount` | `Process.Threads.Count` | New | Stat tile | > 200 → leak? | 0.5 h | - -**Why this matters.** A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one `Process` API call, no perf overhead. - ---- - -## Real-time updates via SignalR - -Today's status surface is poll-based: the HTML page uses a 5-second `meta-refresh`, and Prometheus / custom HMI scrapers hit `/status.json` on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a **live fleet dashboard with many panels open**, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping `bound → recovering`) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update. - -**The recommendation is additive, not replacement.** Keep `/status.json` for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change. - -### Why this is cheap to add - -The `Microsoft.AspNetCore.App` framework reference that Phase 07 added to the csproj **already includes `Microsoft.AspNetCore.SignalR`** — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on `Mbproxy.AdminPort`. No additional port, no additional listener supervision, no additional shutdown path. - -### Architecture - -``` - ┌─→ Dashboard A (subscribed to "all") -ProxyWorker / Supervisors ──┐ │ -ConfigReconciler ───────────┤ │ -ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer") -ServiceCounters ────────────┘ (background loop + │ - immediate-push paths) └─→ Dashboard C (subscribed to "service") -``` - -- **`StatusHub : Hub`** — the SignalR endpoint mounted at `/hub/status` on `AdminPort`. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates. -- **`StatusBroadcaster : IHostedService`** — the background pusher. Holds a `Timer` (or `PeriodicTimer`) that ticks at `PushIntervalMs` (default 1000 ms), builds a `StatusResponse` via the existing `StatusSnapshotBuilder`, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes `PushEventAsync(name, props)` for the immediate-push paths. -- **Immediate-push wiring** — the existing log events (`mbproxy.listener.recovered`, `mbproxy.config.reload.applied`, `mbproxy.backend.failed`, `mbproxy.rewrite.partial_bcd`, etc.) gain a fan-out call to `broadcaster.PushEventAsync(...)` so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick. - -### Hub contract - -**Hub URL:** `https://:/hub/status` - -**Hub groups** — clients subscribe to scopes; the server broadcasts to matching groups: - -| Group | Receives | -|-------|----------| -| `all` | Every update for every PLC + every service-level event | -| `service` | Service-level events only (`mbproxy.config.*`, `mbproxy.admin.*`, `mbproxy.startup.*`, `mbproxy.shutdown.*`) | -| `plc:` | One PLC's snapshots + that PLC's events | - -**Server-side methods** (client → server): - -| Method | Purpose | -|--------|---------| -| `Task SubscribeFleet()` | Join group `all` | -| `Task SubscribeService()` | Join group `service` | -| `Task SubscribePlc(string name)` | Join group `plc:` after validating that `name` exists in current options | -| `Task Unsubscribe()` | Leave every group; the connection stays open but receives nothing | - -**Client-side callbacks** (server → client, named `On*` per SignalR convention): - -| Callback | Payload | When | -|----------|---------|------| -| `OnSnapshot(StatusResponse snapshot)` | Full snapshot of the relevant scope (`all`, `service`, or a single PLC) | Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect | -| `OnPatch(StatusPatch patch)` | Delta of fields that changed since the last push | Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed | -| `OnEvent(StatusEvent ev)` | Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }` | Immediately — fan-out from the existing `[LoggerMessage]` event call sites | - -`StatusPatch` carries only the fields that changed since the previous push: it's a `Dictionary` keyed by JSON path (e.g., `"plcs[2].pdus.forwarded"`, `"plcs[2].listener.state"`). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle. - -### What gets pushed, and when - -| Update kind | Cadence | Volume per PLC | Channel | -|-------------|---------|----------------|---------| -| Counter increments (PDUs, bytes, rewrites) | Every `PushIntervalMs` if changed; coalesced | 1 patch / push tick / subscribed group | `OnPatch` | -| State transitions (`bound ↔ recovering ↔ stopped`) | Immediate | 1 event + 1 patch | `OnEvent` + `OnPatch` | -| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | `OnEvent` | -| Hot-reload applied / rejected | Immediate | 1 event with `propertiesJson` summary | `OnEvent` | -| Periodic full snapshot | Every 60 s | 1 full snapshot | `OnSnapshot` | - -The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth. - -### Configuration - -Extend `appsettings.json` with: - -```jsonc -"Mbproxy": { - // ... existing keys ... - "Admin": { - "SignalR": { - "Enabled": true, - "PushIntervalMs": 1000, // patch cadence - "FullSnapshotIntervalMs": 60000, // periodic re-baseline - "MaxConcurrentClients": 32, // refuse new connections beyond this - "MaxGroupsPerClient": 8 // anti-runaway-subscription guard - } - } -} -``` - -Defaults make the feature opt-in-able-by-omission: if `SignalR.Enabled = false`, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required. - -### Implementation outline - -1. **Hub class** — `src/Mbproxy/Admin/StatusHub.cs`. Inherits `Hub`. Implements the four `Subscribe*` / `Unsubscribe` methods. `OnConnectedAsync` rejects if `Context.Items.Count > MaxConcurrentClients` (track in a static `ConcurrentDictionary` indexed by `ConnectionId`). -2. **Broadcaster** — `src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService`. Constructor takes `IHubContext`, `StatusSnapshotBuilder`, `IOptionsMonitor`. The push loop is a `while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }` body — wins over `Timer` for cancellation correctness. -3. **DTOs** — `StatusPatch` and `StatusEvent` records added to `StatusDto.cs`, registered with the source-gen `StatusJsonContext`. -4. **Event fan-out** — the existing `[LoggerMessage]` partial methods stay; add a thin `RealtimeLogEvents` wrapper class that logs AND calls `broadcaster.PushEventAsync(...)`. Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable. -5. **Hub mapping** — `AdminEndpointHost` adds `app.MapHub("/hub/status")` if `SignalR.Enabled`. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint. -6. **Shutdown** — `StatusBroadcaster.StopAsync` cancels its pump and the hub's `Dispose` chain handles connection teardown. The existing `ShutdownCoordinator` deadline applies. - -### Test approach - -Use the **`Microsoft.AspNetCore.SignalR.Client`** package (NuGet) in the test csproj only. Pattern: - -```csharp -[Fact] -[Trait("Category", "E2E")] -public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException() -{ - // Arrange: start host on a random AdminPort, build a SignalR client. - var connection = new HubConnectionBuilder() - .WithUrl($"http://localhost:{adminPort}/hub/status") - .Build(); - - var patches = new ConcurrentQueue(); - connection.On("OnPatch", patches.Enqueue); - await connection.StartAsync(TestContext.Current.CancellationToken); - await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken); - - // Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1). - // ... drive request through proxy ... - - // Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms. - var deadline = DateTime.UtcNow.AddMilliseconds(500); - while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"))) - await Task.Delay(20, TestContext.Current.CancellationToken); - - patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")); -} -``` - -Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly. - -Coverage targets for the new tests: -1. `SignalR_Subscribe_DeliversInitialSnapshot` -2. `SignalR_Patch_FiresWithinPushInterval_AfterCounterChange` -3. `SignalR_Event_FiresWithin_100ms_OfListenerRecovered` -4. `SignalR_SubscribePlc_OnlyDeliversThatPlcEvents` — verifies group filtering -5. `SignalR_MaxConcurrentClients_RefusesExcess` — capacity guard -6. `SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs` - -### Operational considerations - -- **Authentication / authorisation.** Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy. -- **Transport.** SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport. -- **Backpressure.** `Hub.Clients.Group("all").SendAsync` does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy. -- **Reconnection.** The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side. -- **Cardinality at scale.** 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The `MaxConcurrentClients` guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy. -- **CORS.** If dashboards run on a different origin (likely), enable CORS on the admin app for `/hub/status` only. Add `AdminCors.AllowedOrigins` to `appsettings.json` as an array of allowed origin strings; an empty array means same-origin only. -- **Logging.** SignalR's internal logs are noisy at Information. In `appsettings.json`, set the `Microsoft.AspNetCore.SignalR` category to `Warning` and `Microsoft.AspNetCore.Http.Connections` to `Warning` so the proxy's own event stream isn't drowned out. - -### Effort estimate - -| Work | Hours | -|------|-------| -| Hub + DTOs + broadcaster | 6 h | -| Event fan-out wiring (existing log events) | 3 h | -| AdminEndpointHost integration + appsettings binding | 2 h | -| E2E test suite (6 tests using SignalR .NET client) | 4 h | -| Documentation (this section graduates from proposal to fact; design.md update) | 1 h | -| **Total** | **~16 h** | - -This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production. - ---- - -## Implementation notes - -### Where rates and percentiles should live - -Two reasonable answers: - -1. **Compute in the proxy, expose pre-computed values in `/status.json`.** Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change. -2. **Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates.** Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar. - -**Recommendation:** ship Tier 1 rate metrics computed in-process for the operator who just opens `http://:8080/` in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative. - -### Counter additions vs computed values - -A few proposed KPIs require **new counters in `ProxyCounters` or `ServiceCounters`**, not just derivations: - -- `pdus.lastForwardedUtc` — new `volatile long _lastForwardedTicks` on `ProxyCounters`. -- `listener.boundRatio.*` — new `StateDurationTracker` on `PlcListenerSupervisor`. -- `partialBcd.byClient` / `invalidBcd.byAddress` — new `ConcurrentDictionary` / `ConcurrentDictionary` on `PerPlcContext`. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases). -- `process.*` — read fresh on every snapshot from `Process.GetCurrentProcess()` — no stored state. - -### Snapshot serialization cost - -`StatusResponse` is built per-request to `/status.json`. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., `invalidBcd.byAddress`) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep `/status.json` under a few hundred KB even when something goes badly wrong. - -### Dashboard widget mapping (Grafana-style cheat sheet) - -| Widget | Use for | -|--------|---------| -| **Stat (big number)** | Service-wide aggregates, counts, latest timestamps | -| **Gauge** | Ratios (availability, success rate, queue depth) | -| **Sparkline** | Rates, percentiles, time-series trends | -| **Stacked area** | Bandwidth, PDU-by-FC breakdown over time | -| **Heatmap** | Per-address / per-client dimensional breakdowns | -| **Cell-coloured table** | Per-PLC status (54 rows, one per PLC, columns of KPIs) | - -### Backwards-compat policy - -The fields currently in `/status.json` are **frozen** — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in [`design.md`](design.md) → "Status page" as the contract; new fields ship via PRs that update the contract first. - -## Cross-references - -- Field tables for what ships today: [`design.md`](design.md) → "Status page". -- Stable log event names (some KPIs are derivable by tailing these): [`design.md`](design.md) → "Logging" event-name table. -- Per-counter wiring lives in `src/Mbproxy/Proxy/ProxyCounters.cs` and `src/Mbproxy/ServiceCounters.cs`. -- The status HTML page is rendered by `src/Mbproxy/Admin/StatusHtmlRenderer.cs`; the JSON DTOs and source-gen context live in `src/Mbproxy/Admin/StatusDto.cs`. diff --git a/mbproxy/docs/operations.md b/mbproxy/docs/operations.md deleted file mode 100644 index ccb6744..0000000 --- a/mbproxy/docs/operations.md +++ /dev/null @@ -1,176 +0,0 @@ -# mbproxy operations runbook - -Day-two operations reference for the mbproxy Windows Service: install, upgrade, configuration, logs, and troubleshooting. - -## Install - -### Prerequisites - -- Windows 10 / Server 2019 or later (64-bit). -- PowerShell 5.1+ run as Administrator (the install script uses `#Requires -RunAsAdministrator`). -- The compiled publish output from `dotnet publish` (see [README.md](../README.md) for the exact command). -- Modbus TCP reachable from the proxy host to the PLCs on port 502. -- Port 8080 (or whatever `AdminPort` is set to) available for the status page. - -### Steps - -1. Publish the binaries on the build machine: - - ```powershell - dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true -o C:\build\mbproxy-publish - ``` - -2. Copy the publish output to the target server (or run the install script locally if you built on the server). - -3. Open an elevated PowerShell prompt and run the install script: - - ```powershell - .\install\install.ps1 -PublishOutput C:\build\mbproxy-publish -Start - ``` - - The script: - - Copies binaries to `C:\Program Files\Mbproxy\` (configurable via `-InstallPath`). - - Registers the service with `sc.exe create`. - - Sets failure-recovery: restart after 60 s on first/second failure, no action on third. - - Creates `%ProgramData%\mbproxy\logs\` and sets ACLs if needed. - - Copies `mbproxy.config.template.json` → `%ProgramData%\mbproxy\appsettings.json` **only if no config exists**. - - Registers the Windows Event Log source `mbproxy`. - - With `-Start`, starts the service and waits up to 30 s for `RUNNING` state. - -4. Edit `%ProgramData%\mbproxy\appsettings.json` to configure your PLC list and BCD tags. See the template for inline comments on every field. - -5. If you edited the config before starting, start the service: - - ```powershell - sc.exe start mbproxy - ``` - -6. Verify (smoke checklist — see [Smoke checklist](#first-install-smoke-checklist) below). - -### Re-running install on an existing installation - -The install script is idempotent. Re-running it: -- Stops the service if running. -- Overwrites the binaries. -- Updates the service config via `sc.exe config` (not `sc.exe create`). -- Preserves `%ProgramData%\mbproxy\appsettings.json` (never overwritten on update). -- Skips Event Log source creation if already registered. - -## Upgrade procedure - -1. Publish new binaries on the build machine (same command as install step 1). - -2. Stop the service: - - ```powershell - sc.exe stop mbproxy - ``` - - Wait for the service to reach `STOPPED` state — graceful shutdown drains in-flight PDUs (up to `Connection.GracefulShutdownTimeoutMs`, default 10 s). - -3. Copy new binaries to `C:\Program Files\Mbproxy\` (or run `install.ps1 -PublishOutput ...` to automate steps 2–4): - - ```powershell - Copy-Item -Path C:\build\mbproxy-publish\* -Destination 'C:\Program Files\Mbproxy\' -Force - ``` - -4. Start the service: - - ```powershell - sc.exe start mbproxy - ``` - -5. Check the status page to confirm the new version: - - ```powershell - Invoke-RestMethod http://localhost:8080/status.json | Select-Object -ExpandProperty service - ``` - - The `version` field should show the new build. - -## Uninstall - -```powershell -.\install\uninstall.ps1 -``` - -Options: -- `-KeepConfig` — preserves `%ProgramData%\mbproxy\appsettings.json` for re-install. -- Log files are **always archived** to `%ProgramData%\mbproxy.archived-\logs\` regardless of `-KeepConfig`. They are never deleted. - -## Configuration - -The service reads `%ProgramData%\mbproxy\appsettings.json` at startup and watches it for changes while running. Most settings are hot-reloadable; a save triggers a re-bind of `IOptionsMonitor` and a per-change-kind reconcile. - -- Full schema (every `Mbproxy:*` key, defaults, validation rules, examples): [`Operations/Configuration.md`](Operations/Configuration.md). -- Per-change-kind reconcile semantics (what propagates instantly vs. what requires a restart): [`Features/HotReload.md`](Features/HotReload.md). - -If a reload is rejected (`mbproxy.config.reload.rejected` in the log), the service continues running with the previous config. Fix the JSON and save again — the next valid file write is accepted. - -## Logs - -### Location - -Rolling log files live at: `C:\ProgramData\mbproxy\logs\mbproxy-.log` - -One file per day, retained for 30 days by default (controlled by `retainedFileCountLimit` in the Serilog config section). - -### Windows Event Log - -When running as a Windows Service, the `EventLogBridge` sink writes events at Error level and above to the Windows Application Event Log under source `mbproxy`. View with: - -```powershell -Get-EventLog -LogName Application -Source mbproxy -Newest 20 -``` - -Or open Event Viewer → Windows Logs → Application, filter by source `mbproxy`. - -### Log survival after uninstall - -`uninstall.ps1` **never deletes log files**. It moves `logs\` to a timestamped archive at `%ProgramData%\mbproxy.archived-\logs\` so post-crash diagnostics remain accessible. - -## Status page - -**URL:** `http://:/` (default port 8080; change via `Mbproxy.AdminPort` in `appsettings.json`). - -Routes: `GET /` (auto-refreshing HTML, no external assets) and `GET /status.json` (same data as JSON for monitoring scrapers). - -The full endpoint shape, every JSON field, counter semantics, and scraping examples live in [`Operations/StatusPage.md`](Operations/StatusPage.md). KPI catalog and dashboard guidance: [`kpi.md`](kpi.md). - -## Common failure modes - -The full diagnosis playbook — startup bind conflicts, backend connectivity, hot-reload validation errors, BCD rewrite anomalies, performance and queue-depth issues, response-cache anomalies, and graceful-shutdown problems — is keyed to log events and status counters in [`Operations/Troubleshooting.md`](Operations/Troubleshooting.md). The complete `mbproxy.*` event catalog with levels, properties, and operator implications is in [`Reference/LogEvents.md`](Reference/LogEvents.md). - -## First-install smoke checklist - -Run these commands after `install.ps1 -Start` to verify the deployment: - -```powershell -# 1. Service is running -Get-Service mbproxy | Select-Object Status, DisplayName - -# 2. Status page is reachable -Invoke-WebRequest http://localhost:8080/ -UseBasicParsing | Select-Object StatusCode - -# 3. JSON endpoint returns expected fields -$status = Invoke-RestMethod http://localhost:8080/status.json -$status.service | Select-Object version, uptimeSeconds -$status.listeners - -# 4. Log file exists and is recent -Get-Item "C:\ProgramData\mbproxy\logs\mbproxy-*.log" | Sort-Object LastWriteTime -Descending | Select-Object -First 1 - -# 5. No Error events in the Event Log -Get-EventLog -LogName Application -Source mbproxy -EntryType Error -Newest 5 - -# 6. Stop the service cleanly (graceful shutdown within 10 s) -$sw = [System.Diagnostics.Stopwatch]::StartNew() -sc.exe stop mbproxy -$deadline = [DateTime]::UtcNow.AddSeconds(15) -do { Start-Sleep 1 } until ((Get-Service mbproxy).Status -eq 'Stopped' -or [DateTime]::UtcNow -gt $deadline) -$sw.Stop() -Write-Host "Stop elapsed: $($sw.ElapsedMilliseconds) ms" -(Get-Service mbproxy).Status # Should be Stopped -``` - -**Note:** This checklist documents the expected steps. It was not executed on a dedicated clean VM (the proxy was developed and unit/E2E tested in-process). Run this checklist on first deployment to a production host. diff --git a/mbproxy/docs/plan/00-bootstrap.md b/mbproxy/docs/plan/00-bootstrap.md deleted file mode 100644 index 5d2b365..0000000 --- a/mbproxy/docs/plan/00-bootstrap.md +++ /dev/null @@ -1,179 +0,0 @@ -# Phase 00 — Bootstrap - -Scaffold the .NET 10 Worker Service project and the test project. Wire up Generic Host, Serilog, Windows-Service registration, and `MbproxyOptions` POCOs bound via `IOptionsMonitor`. No proxy logic yet — the service starts, logs "ready", and stops cleanly. - -**Depends on:** nothing. Must run alone. -**Parallel-safe with:** nothing. Phase 00 owns the initial `.csproj` and solution; subsequent phases append. - -## Goal - -Produce a minimal but production-shaped host that all subsequent phases plug into. The host must: - -- Target `.NET 10` (`net10.0`), be registered as a Windows Service via `Microsoft.Extensions.Hosting.WindowsServices`, and also run as a console under `dotnet run` for local dev. -- Load `appsettings.json` with `reloadOnChange: true`, bind the `"Mbproxy"` section to typed POCOs, and expose them via `IOptionsMonitor`. -- Use Serilog with console + rolling-file sinks under `%ProgramData%\mbproxy\logs\` (configurable, but default that location). -- Set `true` and `enable` in the csproj. These stay set forever. - -## Outputs (files created in this phase) - -``` -Mbproxy.slnx -src/Mbproxy/Mbproxy.csproj -src/Mbproxy/Program.cs -src/Mbproxy/HostingExtensions.cs # AddMbproxyOptions, AddMbproxySerilog -src/Mbproxy/Options/MbproxyOptions.cs -src/Mbproxy/Options/BcdTagOptions.cs -src/Mbproxy/Options/PlcOptions.cs -src/Mbproxy/Options/ConnectionOptions.cs -src/Mbproxy/Options/ResilienceOptions.cs -src/Mbproxy/Options/BcdTagListOptions.cs # the Global + per-PLC Add/Remove DTOs -src/Mbproxy/Workers/HeartbeatWorker.cs # one-line "service alive" worker; deleted by phase 03 -src/Mbproxy/appsettings.json # minimal default with empty Plcs array -tests/Mbproxy.Tests/Mbproxy.Tests.csproj -tests/Mbproxy.Tests/HostSmokeTests.cs -tests/Mbproxy.Tests/Options/MbproxyOptionsBindingTests.cs -.gitignore # add bin/, obj/, .vs/, *.user, tests/sim/.venv/, %ProgramData%\mbproxy\ -``` - -No other files. Phase 00 does NOT create: -- BCD codec types (phase 02) -- Proxy types (phase 03) -- Listener supervisor (phase 05) -- Status page (phase 07) - -## Tasks - -1. **Create `Mbproxy.slnx`** referencing the two csprojs. -2. **`src/Mbproxy/Mbproxy.csproj`** — ``, `TargetFramework=net10.0`, `OutputType=Exe`, `Nullable=enable`, `TreatWarningsAsErrors=true`, `ImplicitUsings=enable`. PackageReferences: - - `Microsoft.Extensions.Hosting` (latest stable for .NET 10) - - `Microsoft.Extensions.Hosting.WindowsServices` - - `Serilog.Extensions.Hosting` - - `Serilog.Settings.Configuration` - - `Serilog.Sinks.Console` - - `Serilog.Sinks.File` - - `Polly` (referenced now so phase 04/05 don't have to touch this csproj for the package; usage is deferred) -3. **`Options/MbproxyOptions.cs`** and siblings — typed POCOs that mirror the appsettings schema in [`../design.md`](../design.md) → Configuration. Keep them plain DTOs (`public sealed class` with init-only properties). Use `IValidateOptions` for cross-field checks at the **schema** level only (no business rules like "duplicate addresses" — those move to phase 06 along with hot-reload). -4. **`HostingExtensions.cs`** — extension methods on `IHostApplicationBuilder` named `AddMbproxyOptions(IConfiguration)` and `AddMbproxySerilog(IConfiguration)`. Keep `Program.cs` thin: read config, call the two extensions, register `HeartbeatWorker`, run. -5. **`Program.cs`** — Generic Host with `.UseWindowsService()`. `await Host.CreateApplicationBuilder(args)...Build().RunAsync()`. Honour `--console` as a no-op flag for documentation symmetry with the design (the worker SDK + UseWindowsService combo already runs in console mode under `dotnet run`). -6. **`Workers/HeartbeatWorker.cs`** — `BackgroundService` that logs `mbproxy.startup.ready` once after `Task.Delay(100)` (so Serilog has flushed) and then idles. This worker is deleted in phase 03 when the real listener supervisor takes over; it exists so phase 00's smoke test has something to assert. -7. **`appsettings.json`** — minimal, valid against the POCOs, with `Plcs: []`. Include the full key shape (`BcdTags.Global`, `AdminPort`, `Connection`, `Resilience`) so future phases just fill in values. -8. **`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`** — Microsoft.NET.Sdk, `TargetFramework=net10.0`, same `Nullable`/`TreatWarningsAsErrors`. ProjectReference to `src/Mbproxy/Mbproxy.csproj`. PackageReferences: - - `Microsoft.NET.Test.Sdk` - - `xunit` (v3 if a stable release exists; v2 otherwise — record the decision in the csproj comment) - - `xunit.runner.visualstudio` - - `Shouldly` -9. **`HostSmokeTests.cs`** — build the host with `Host.CreateApplicationBuilder` against a synthetic config, start it on a `CancellationTokenSource` with a short deadline, assert it logged `mbproxy.startup.ready` and shut down without unhandled exceptions. -10. **`MbproxyOptionsBindingTests.cs`** — bind a hand-written `Dictionary` config source into `MbproxyOptions`, assert all fields populate correctly (including a `Plcs` entry with `BcdTags.Add` and `BcdTags.Remove`). - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Options; - -public sealed class MbproxyOptions { - public BcdTagListOptions BcdTags { get; init; } = new(); - public IReadOnlyList Plcs { get; init; } = []; - public int AdminPort { get; init; } = 8080; - public ConnectionOptions Connection { get; init; } = new(); - public ResilienceOptions Resilience { get; init; } = new(); -} - -public sealed class BcdTagListOptions { - public IReadOnlyList Global { get; init; } = []; -} - -public sealed class BcdTagOptions { - public ushort Address { get; init; } - public byte Width { get; init; } // 16 or 32 -} - -public sealed class PlcOptions { - public string Name { get; init; } = ""; - public int ListenPort { get; init; } - public string Host { get; init; } = ""; - public PlcBcdOverrides? BcdTags { get; init; } -} - -public sealed class PlcBcdOverrides { - public IReadOnlyList Add { get; init; } = []; - public IReadOnlyList Remove { get; init; } = []; -} - -public sealed class ConnectionOptions { - public int BackendConnectTimeoutMs { get; init; } = 3000; - public int BackendRequestTimeoutMs { get; init; } = 3000; -} - -public sealed class ResilienceOptions { - public RetryProfile BackendConnect { get; init; } = new() { MaxAttempts = 3, BackoffMs = [100, 500, 2000] }; - public RecoveryProfile ListenerRecovery { get; init; } = new() { - InitialBackoffMs = [1000, 2000, 5000, 15000, 30000], - SteadyStateMs = 30000, - }; -} - -public sealed class RetryProfile { - public int MaxAttempts { get; init; } - public IReadOnlyList BackoffMs { get; init; } = []; -} - -public sealed class RecoveryProfile { - public IReadOnlyList InitialBackoffMs { get; init; } = []; - public int SteadyStateMs { get; init; } -} -``` - -```csharp -namespace Mbproxy; - -internal static class HostingExtensions { - public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder b); - public static IHostApplicationBuilder AddMbproxySerilog(this IHostApplicationBuilder b); -} -``` - -```csharp -namespace Mbproxy.Workers; -internal sealed class HeartbeatWorker : BackgroundService { /* logs mbproxy.startup.ready */ } -``` - -No other public types in this phase. - -## Tests required - -### Unit (`Category = Unit`, default) - -1. `MbproxyOptionsBinding_BindsGlobalBcdTags_From_appsettings` -2. `MbproxyOptionsBinding_BindsPerPlcAddAndRemove` -3. `MbproxyOptionsBinding_DefaultsAreApplied_WhenSectionMissing` (AdminPort=8080, Resilience defaults) -4. `MbproxyOptionsBinding_RejectsInvalidWidth` — IValidateOptions returns Fail for `Width != 16 && Width != 32`. Schema-level only; address-overlap validation is phase 06. -5. `HostSmoke_StartsAndStops_Cleanly_AndLogs_StartupReady` — uses a Serilog sink that captures events to memory; asserts the `mbproxy.startup.ready` event fired at Information. -6. `HostSmoke_ShutdownIsOrdered` — host responds to `StopAsync` within 2 s. - -### E2E (`Category = E2E`) - -None in this phase. The simulator harness is phase 01. - -## Phase gate - -- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings. -- [ ] `dotnet test --filter Category!=E2E` — all green, ≥6 tests. -- [ ] `dotnet run --project src/Mbproxy` — service starts, logs `mbproxy.startup.ready` to console within 5 s, exits cleanly on Ctrl-C. -- [ ] `appsettings.json` is a valid JSON document and parses into a populated `MbproxyOptions` instance via the test harness. -- [ ] [`../design.md`](../design.md) is unchanged (this phase introduces no new design decisions). -- [ ] Resource index entry for `docs/plan/00-bootstrap.md` is not needed (the plan README routes there). - -## Out of scope - -- BCD encode/decode logic (phase 02). -- TcpListener / Modbus framing / byte forwarding (phase 03). -- Polly retry pipelines (referenced as a NuGet, used starting in phase 04/05). -- Address-overlap / duplicate-port validation (phase 06). -- AdminPort HTTP endpoint (phase 07). -- Service install / uninstall scripts (phase 08). - -## Notes for the subagent - -- Do not create `README.md` for the tool root yet — that's a phase 08 deliverable when there's something installable to document. -- If the `xunit` v3 vs v2 question is unclear at implementation time, prefer v3 if available on NuGet — record the choice in a single-line comment at the top of the test csproj. Future phases must not silently switch. -- Use `LoggerMessage`-source-generated logging (`[LoggerMessage]`) for the heartbeat event so phases that add more log events can follow the same pattern. Set `EventId.Name = "mbproxy.startup.ready"`. diff --git a/mbproxy/docs/plan/01-simulator-harness.md b/mbproxy/docs/plan/01-simulator-harness.md deleted file mode 100644 index e6cc752..0000000 --- a/mbproxy/docs/plan/01-simulator-harness.md +++ /dev/null @@ -1,108 +0,0 @@ -# Phase 01 — Simulator harness - -Wrap the existing pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) as a managed lifecycle for xUnit tests. After this phase, any test class that declares `[Collection(nameof(DL205SimulatorCollection))]` gets a running pymodbus server on a known port, with skip-safe behaviour when Python is unavailable. - -**Depends on:** Phase 00 (test project exists). -**Parallel-safe with:** Phase 02, Phase 03. (Touches only `tests/sim/` and `tests/Mbproxy.Tests/Sim/`. Disjoint from codec and proxy work.) - -## Goal - -Eliminate "did the simulator start?" as a source of flaky tests. Encode the launch / readiness-probe / shutdown / cleanup contract once, in a fixture, so phases 03 / 04 / 05 / 06 / 07 don't each reinvent it. Tests must be able to declare a dependency on the simulator and get a hot port back, OR get a clean skip if the environment can't provide one. - -## Outputs - -``` -tests/sim/run-dl205-sim.ps1 # idempotent launcher; venv-provisioning -tests/sim/README.md # how to run the simulator standalone -tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs -tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs -tests/Mbproxy.Tests/Sim/SimulatorSmokeTests.cs # connects, sends FC03, verifies a seeded BCD register -``` - -Modifications: -- `.gitignore` already has `tests/sim/.venv/` from phase 00 — verify it's present. -- `tests/Mbproxy.Tests/Mbproxy.Tests.csproj` — add `NModbus` PackageReference (chosen for its small footprint and net10.0 compatibility; record the choice as a top-of-csproj comment). This is the Modbus TCP client used by tests against the simulator from this phase forward. - -No other files. - -## Tasks - -1. **`tests/sim/run-dl205-sim.ps1`** — pure PowerShell. Parameters: `-Profile ` (default `../DL260/dl205.json` relative to script), `-Port ` (default 5020). Behaviour: - - If `tests/sim/.venv` doesn't exist: `python -m venv tests/sim/.venv`, then `tests/sim/.venv/Scripts/pip.exe install "pymodbus[server]"` pinned to a known version (record version in the script + README). - - Activate the venv (`& tests/sim/.venv/Scripts/activate.ps1`). - - Exec `pymodbus.server run --modbus-config-path --modbus-server tcp --port `. Output streams to stdout/stderr; on script termination, the child server dies with it. - - Exit codes: 0 on clean exit, 1 on venv provisioning failure, 2 on pymodbus launch failure, 3 if the profile file is missing. -2. **`DL205SimulatorFixture : IAsyncLifetime`** — - - `InitializeAsync`: pick a free local port (bind/release a `TcpListener` on `IPEndPoint.Any:0`, capture the port, dispose). Spawn `pwsh -NoProfile -File -Port ` via `System.Diagnostics.Process` with `RedirectStandardOutput/Error`. Poll `new TcpClient().ConnectAsync("127.0.0.1", port)` at 100 ms intervals for up to 10 s. If the simulator never accepts a connection, capture stderr tail, set `SkipReason`, and dispose the process. - - `DisposeAsync`: send Ctrl-C to the process group (`Process.Kill(entireProcessTree: true)` on Windows is the pragmatic choice — pymodbus handles SIGTERM gracefully but Windows lacks proper signals; document the tradeoff in a comment). Wait up to 5 s for exit. - - Public surface: `string Host { get; }` (always `127.0.0.1`), `int Port { get; }`, `string? SkipReason { get; }`, `string LogTail { get; }` (last ~50 lines of stderr, for diagnosis). -3. **`DL205SimulatorCollection`** — - ```csharp - [CollectionDefinition(nameof(DL205SimulatorCollection))] - public sealed class DL205SimulatorCollection : ICollectionFixture { } - ``` - Tests that need the fixture declare `[Collection(nameof(DL205SimulatorCollection))]`. -4. **`SimulatorSmokeTests`** — `[Collection(nameof(DL205SimulatorCollection))] [Trait("Category", "E2E")]`. Three tests: - - `Simulator_AcceptsTcpConnection` - - `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — reads register 0, expects `0xCAFE` (the seeded marker from `dl205.json`). Uses NModbus directly. This proves the dl205.json profile is in fact loaded. - - `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — reads register 1072, expects raw `0x1234` (= 4660). This is the BCD register the proxy will rewrite later; phase 04's e2e test will read the SAME register through the proxy and assert 1234 instead. -5. **`tests/sim/README.md`** — a few lines: "Run `pwsh ./run-dl205-sim.ps1 -Port 5020` to launch the simulator standalone. Used by xUnit tests via `DL205SimulatorFixture`. Requires Python 3.10+; the script provisions a venv on first run." - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Tests.Sim; - -public sealed class DL205SimulatorFixture : IAsyncLifetime { - public string Host { get; } - public int Port { get; } - public string? SkipReason { get; } - public string LogTail { get; } - public Task InitializeAsync(); - public Task DisposeAsync(); -} - -[CollectionDefinition(nameof(DL205SimulatorCollection))] -public sealed class DL205SimulatorCollection : ICollectionFixture { } -``` - -No production code is added in this phase. - -## Tests required - -### Unit (Category = Unit) - -None in this phase. The fixture itself is a test-infrastructure component; its correctness is verified by the e2e smoke tests below. - -### E2E (Category = E2E) - -1. `Simulator_AcceptsTcpConnection` — open a TCP socket to `fixture.Host:fixture.Port` within the fixture lifetime. -2. `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — NModbus FC03, asserts `0xCAFE`. -3. `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — NModbus FC03, asserts raw `0x1234` (4660). - -When `SkipReason` is set, all three skip with `Assert.Skip(fixture.SkipReason)`. The phase gate explicitly verifies that on a machine WITH Python+pymodbus, none of them skip — skips are an environment failure, not a test pass. - -## Phase gate - -- [ ] `pwsh tests/sim/run-dl205-sim.ps1 -Port 5020` standalone — script provisions a venv on first run, server logs "Modbus TCP server listening" within 10 s, Ctrl-C exits cleanly. -- [ ] On second run: venv exists, script skips provisioning, server starts in < 2 s. -- [ ] On a machine WITHOUT Python: `SkipReason` is non-null and tests skip rather than fail. -- [ ] On a machine WITH Python: `SkipReason` is null, all three e2e smoke tests pass. -- [ ] `dotnet test --filter Category=E2E` is green on the dev machine. -- [ ] `dotnet test --filter Category!=E2E` still green (no regression to phase 00's tests). -- [ ] Build zero-warnings. -- [ ] `tests/sim/README.md` documents the manual launch path. - -## Out of scope - -- Multiple simultaneous simulators (one fixture instance is enough for all e2e tests via `ICollectionFixture`). -- Alternate profiles selected via `MODBUS_SIM_PROFILE` env var — defer until phase 04 actually needs a partial-overlap scenario; add the env-var support then. -- A C# pymodbus replacement / in-process Modbus mock. The pymodbus profile is the source of truth for DL-series quirks and we're not duplicating it. -- pip-mirror or offline-install support. CI is expected to have network or a pre-warmed venv; if a customer site needs offline install, that's a deployment concern (phase 08). - -## Notes for the subagent - -- Capture the chosen `pymodbus` version pin in both `run-dl205-sim.ps1` and `tests/sim/README.md` so the version isn't lost across re-provisioning. -- The free-port-picker pattern (bind on `:0`, capture port, dispose, then hand the port to the child process) has an inherent TOCTOU race — another process could grab the port between dispose and pymodbus binding. In practice this is rare; acceptable for tests. Note the trade-off in a comment. -- Pymodbus log output is verbose. Pipe it through a line buffer; only the last ~50 lines need to be available via `LogTail` for diagnosis. -- Do not commit the `.venv/` directory. diff --git a/mbproxy/docs/plan/02-bcd-codec.md b/mbproxy/docs/plan/02-bcd-codec.md deleted file mode 100644 index e97368e..0000000 --- a/mbproxy/docs/plan/02-bcd-codec.md +++ /dev/null @@ -1,157 +0,0 @@ -# Phase 02 — BCD codec - -Pure logic for encoding integers as DirectLOGIC BCD nibbles and decoding nibbles back. No I/O, no network, no Modbus framing. The codec exposed by this phase is what phase 04 plugs into the proxy. - -**Depends on:** Phase 00 (csproj + options POCOs). -**Parallel-safe with:** Phase 01, Phase 03. (All work lives under `src/Mbproxy/Bcd/` and `tests/Mbproxy.Tests/Bcd/` — disjoint from sim harness and proxy plumbing.) - -## Goal - -A tiny, allocation-free codec library that: -- Encodes a non-negative `int` (capped at the width's range) to either one 16-bit raw register value or a `(low, high)` register pair for 32-bit BCD per the design's CDAB digit-layout rule. -- Decodes one or two raw register values back to an `int`. -- Resolves `Global + per-PLC Add - per-PLC Remove` into an **immutable per-PLC `BcdTagMap`** that the rewriter looks up by Modbus address in O(1). - -The codec is the single source of BCD-encoding correctness in the system. Phase 04 must not reimplement any nibble math. - -## Outputs - -``` -src/Mbproxy/Bcd/BcdCodec.cs # static class: Encode16, Decode16, Encode32, Decode32 -src/Mbproxy/Bcd/BcdTag.cs # the public record (mirrors design.md exactly) -src/Mbproxy/Bcd/BcdTagMap.cs # immutable, address-keyed lookup; describes per-PLC resolved tags -src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolves global + Add - Remove into a map; runs validation -src/Mbproxy/Bcd/BcdValidationError.cs # enum + ValidationResult record - -tests/Mbproxy.Tests/Bcd/BcdCodecTests.cs -tests/Mbproxy.Tests/Bcd/BcdTagMapBuilderTests.cs -``` - -No other files. The proxy plumbing layer doesn't exist yet and isn't touched. - -## Tasks - -1. **`BcdTag.cs`** — `public sealed record BcdTag(ushort Address, byte Width)` with a static factory `Create(ushort, byte)` that throws on `Width != 16 && Width != 32`. This record is the type phases 04 / 06 / 07 will use. -2. **`BcdCodec.cs`** — `internal static class` with four pure methods. Internal because the proxy is the only consumer; nothing else in the assembly should call these. - - `static ushort Encode16(int value)` — value in `[0, 9999]`; produces the 16-bit BCD register, e.g. `1234 → 0x1234`. Throws `ArgumentOutOfRangeException` if value is out of range. - - `static int Decode16(ushort raw)` — inverse. If any nibble is `>= 0xA`, return a `int.MinValue` sentinel? No — throw `FormatException` with the raw value in the message. The rewriter catches this and surfaces a `mbproxy.rewrite.invalid_bcd` event (event name added in phase 04). - - `static (ushort low, ushort high) Encode32(int value)` — value in `[0, 99_999_999]`; produces the CDAB pair, where `low` = low 4 BCD digits (least-significant) and `high` = high 4 BCD digits (most-significant). Decoded decimal = `high * 10000 + low_as_bcd_decoded`. Throws if out of range. - - `static int Decode32(ushort low, ushort high)` — inverse. Throws `FormatException` if either word has a bad nibble. -3. **`BcdTagMap.cs`** — `public sealed class BcdTagMap` wrapping a frozen address-keyed dictionary. Methods: - - `static BcdTagMap Empty { get; }` - - `bool TryGet(ushort address, out BcdTag tag)` — O(1) lookup. - - `bool TryGetForRange(ushort startAddress, ushort qty, out IEnumerable<(int offset, BcdTag tag)> hits)` — returns every BCD tag whose register footprint intersects `[startAddress, startAddress+qty)`. Offsets are relative to `startAddress`. Used by the rewriter to know which slots in a multi-register PDU to touch. - - `int Count { get; }`, `IEnumerable All { get; }` — for telemetry / status page. -4. **`BcdTagMapBuilder.cs`** — given `BcdTagListOptions Global` and `PlcBcdOverrides? perPlc`, produce a `(BcdTagMap, ValidationResult)`. Validation rules from design.md: - - Reject duplicate addresses within the resolved list (Add+Global after Remove). - - Reject 32-bit entries whose high register (`Address+1`) collides with any other entry's address (16-bit or 32-bit). - - Warn on `Remove` entries that don't match any address in Global (this is not a failure; the warning rides on `ValidationResult.Warnings`). - - Reject `Width` values other than 16/32 (defensive; phase 00's `IValidateOptions` should already have caught this, but the builder is the last line of defence). -5. **`BcdValidationError.cs`** — `public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }`. `public sealed record ValidationResult(BcdTagMap Map, IReadOnlyList Errors, IReadOnlyList Warnings)`. Errors fail the build; warnings ride along. - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Bcd; - -public sealed record BcdTag(ushort Address, byte Width) { - public static BcdTag Create(ushort address, byte width); - public bool IsThirtyTwoBit => Width == 32; - public ushort HighRegister => (ushort)(Address + 1); // throws if Width != 32 -} - -public sealed class BcdTagMap { - public static BcdTagMap Empty { get; } - public int Count { get; } - public IEnumerable All { get; } - public bool TryGet(ushort address, out BcdTag tag); - public bool TryGetForRange(ushort startAddress, ushort qty, out IReadOnlyList hits); -} - -public readonly record struct RangeHit(int OffsetWords, BcdTag Tag); - -public static class BcdTagMapBuilder { - public static ValidationResult Build(BcdTagListOptions global, PlcBcdOverrides? perPlc); -} - -public sealed record ValidationResult( - BcdTagMap Map, - IReadOnlyList Errors, - IReadOnlyList Warnings); - -public sealed record BcdError(BcdValidationError Kind, string Message, ushort? Address); -public sealed record BcdWarning(string Message, ushort? Address); -public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth } -``` - -```csharp -namespace Mbproxy.Bcd; -internal static class BcdCodec { - public static ushort Encode16(int value); - public static int Decode16(ushort raw); - public static (ushort low, ushort high) Encode32(int value); - public static int Decode32(ushort low, ushort high); -} -``` - -## Tests required - -### Unit (`Category = Unit`) - -`BcdCodecTests` (≥ 16 tests): - -1. `Encode16_1234_Returns_0x1234` -2. `Encode16_0_Returns_0x0000` -3. `Encode16_9999_Returns_0x9999` -4. `Encode16_10000_Throws_OutOfRange` -5. `Encode16_Negative_Throws_OutOfRange` -6. `Decode16_0x1234_Returns_1234` -7. `Decode16_0x0000_Returns_0` -8. `Decode16_0x9999_Returns_9999` -9. `Decode16_0x123A_Throws_Format` — bad nibble `A`. -10. `Encode32_12345678_Returns_LowHigh_5678_1234` — verify `low = 0x5678`, `high = 0x1234`. -11. `Encode32_0_Returns_LowHigh_0_0` -12. `Encode32_99999999_Returns_LowHigh_9999_9999` -13. `Encode32_100000000_Throws_OutOfRange` -14. `Decode32_LowHigh_5678_1234_Returns_12345678` -15. `Decode32_BadNibble_InLow_Throws` -16. `Decode32_BadNibble_InHigh_Throws` -17. `RoundTrip16_AllValuesUnder10000` — `[Theory]` with `[InlineData]` for boundary values; for the dense check use `[Theory] [MemberData]` enumerating every 100th value. The codec must be `Decode16(Encode16(v)) == v`. - -`BcdTagMapBuilderTests` (≥ 10 tests): - -1. `Build_EmptyGlobal_EmptyOverride_ReturnsEmptyMap` -2. `Build_GlobalOnly_PopulatesMap` -3. `Build_PerPlcAdd_AppendsToGlobal` -4. `Build_PerPlcRemove_DropsFromGlobal` -5. `Build_AddOverrideSameAddressAsGlobal_AddWidthWins` -6. `Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError` -7. `Build_32BitHighRegOverlaps16BitGlobal_ReturnsOverlappingHighRegisterError` -8. `Build_Remove_OfNonExistentAddress_ReturnsWarning_NotError` -9. `Build_InvalidWidth_ReturnsInvalidWidthError` -10. `Map_TryGetForRange_ReturnsAllHits_InOrder` — covers full overlap, partial overlap (low only, high only), and no overlap. - -### E2E (Category = E2E) - -None. The codec is pure logic. - -## Phase gate - -- [ ] Zero-warnings build. -- [ ] `dotnet test --filter Category=Unit` — all green, ≥ 26 new tests. -- [ ] `BcdCodec` is `internal`; nothing outside `Mbproxy.Bcd` calls it directly. -- [ ] `BcdTagMap` has zero allocations on `TryGet` and on the hot `TryGetForRange` path (verify via a microbench note in the test file's docstring; no benchmark project added). -- [ ] [`../design.md`](../design.md) → "BCD tag shape" matches the public record exactly; if the spec drifted during implementation, update design.md in this PR. - -## Out of scope - -- Signed BCD. Design explicitly excludes it. -- Half-byte / "BCD with sign nibble" variants used by some DL-family math instructions. Not in the design's tag shape. -- The actual PDU-byte-level rewriting (FC parsing, MBAP framing). That's phase 04. -- Telemetry counters. The codec exposes nothing to counters; phase 04 instruments the rewrite pipeline that USES the codec. - -## Notes for the subagent - -- The DirectLOGIC CDAB digit layout is the most-likely-to-confuse part of this phase. Re-read [`../design.md`](../design.md) → "BCD tag shape" and [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Word Order" before implementing `Encode32`/`Decode32`. The seeded marker in `dl205.json` for the float32 case (`HR[1056]=0x0000, HR[1057]=0x3FC0` for IEEE 1.5) confirms low-word-first; the BCD-32 case is the same word order with BCD nibble semantics inside each word. -- `BcdTagMapBuilder` is single-shot — given inputs, produce a map. There is NO `IObservable` here. Phase 06 owns reload-driven rebuilds and just calls `Build` again. -- `TryGetForRange` is on the hot path for FC03/04 responses. Implementation should pre-bucket BCD tags by 256-register window if it makes the lookup faster, but only if a microbench shows a real win. Don't preoptimise. diff --git a/mbproxy/docs/plan/03-proxy-plumbing.md b/mbproxy/docs/plan/03-proxy-plumbing.md deleted file mode 100644 index 2daaf53..0000000 --- a/mbproxy/docs/plan/03-proxy-plumbing.md +++ /dev/null @@ -1,129 +0,0 @@ -# Phase 03 — Proxy plumbing - -The minimum-viable proxy: one `TcpListener` per configured PLC, 1:1 upstream-client ↔ backend-socket, byte-for-byte forwarding both directions, transparent MBAP TxId / unit ID. No BCD rewriting yet — that's phase 04. No supervisor / auto-recovery — that's phase 05. - -**Depends on:** Phase 00 (host, options). -**Parallel-safe with:** Phase 02 (BCD codec lives under `src/Mbproxy/Bcd/`; this phase lives under `src/Mbproxy/Proxy/`). - -## Goal - -Stand up the listener-and-forwarder pair so an e2e test can: -1. Configure the proxy with `Plcs: [{ Host: "127.0.0.1", Port: , ListenPort: }]`. -2. Start the host. -3. Drive NModbus against `127.0.0.1:` and see the SAME bytes the simulator would return on a direct connection. - -The proxy is transparent in this phase. The BCD rewrite hook point is reserved but not wired. - -## Outputs - -``` -src/Mbproxy/Proxy/PlcListener.cs # owns one TcpListener; accepts loop -src/Mbproxy/Proxy/PlcConnectionPair.cs # one upstream socket + one backend socket; forwarder -src/Mbproxy/Proxy/IPduPipeline.cs # the rewrite hook contract (no-op impl in this phase) -src/Mbproxy/Proxy/NoopPduPipeline.cs # the no-op impl -src/Mbproxy/Proxy/ProxyWorker.cs # BackgroundService that owns all PlcListeners -src/Mbproxy/Proxy/MbapFrame.cs # MBAP header parse helpers (length, txid, unit) - -tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs # e2e against the simulator -tests/Mbproxy.Tests/Proxy/MbapFrameTests.cs # unit tests for the MBAP parser -``` - -Modifications: -- `src/Mbproxy/Program.cs` — register `ProxyWorker` as a hosted service. The `HeartbeatWorker` from phase 00 is DELETED in this phase (its job is replaced by ProxyWorker logging `mbproxy.startup.ready` after all listeners are bound). -- `src/Mbproxy/Workers/HeartbeatWorker.cs` — DELETED. - -## Tasks - -1. **`MbapFrame.cs`** — pure helpers, no allocations. Static methods: - - `static bool TryParseHeader(ReadOnlySpan buffer, out ushort txId, out ushort protocolId, out ushort length, out byte unitId)` — returns false if buffer.Length < 7. - - `static int TotalFrameLength(ushort lengthField)` — `lengthField + 6` (7 header bytes minus the 1-byte unit ID which is counted in the length field). -2. **`IPduPipeline.cs`** — the rewrite hook. Single method: - ```csharp - void Process(MbapDirection direction, ReadOnlySpan mbapHeader, Span pdu, PduContext context); - ``` - `MbapDirection` is `RequestToBackend` or `ResponseToClient`. `PduContext` carries the per-pair state (counters, PLC name, configured tag map). In phase 03, the only implementation is `NoopPduPipeline` which does nothing. -3. **`NoopPduPipeline.cs`** — empty `Process` method. Registered as the default `IPduPipeline` in DI for this phase. Phase 04 replaces it with the real rewriter. -4. **`PlcConnectionPair.cs`** — owns the upstream `Socket` (or `TcpClient`) handed to it by `PlcListener.Accept`, opens a fresh backend socket to the configured PLC, and runs two `Task`s: - - **Upstream → backend**: read one full MBAP frame at a time (header → length → rest), call `pipeline.Process(RequestToBackend, header, pdu, ctx)`, write the frame to the backend. - - **Backend → upstream**: same shape, with `ResponseToClient`. - Either task ending (socket closed, exception, cancellation) tears down both sides cleanly. No retry loop; that's phase 05. - Backend connect is wrapped in a `try`/`catch` with the configured `BackendConnectTimeoutMs`. Connect failures close the upstream socket immediately and log `mbproxy.backend.failed`. Polly bounded retries on backend connect are **deferred to phase 05** to keep this phase scope tight — note the deferral in code with `// Phase 05: wrap in Polly pipeline`. -5. **`PlcListener.cs`** — owns one `TcpListener` for one PLC. `StartAsync` binds; on bind failure, throws (caller logs `mbproxy.startup.bind.failed` and decides what to do — phase 05 will introduce the supervisor that turns this into a recoverable state). On each accept, hands the socket to a fresh `PlcConnectionPair` and runs it on the thread-pool. -6. **`ProxyWorker.cs`** — `BackgroundService`. On start: enumerates `MbproxyOptions.Plcs`, instantiates one `PlcListener` per entry, starts them all. Each bind that succeeds logs `mbproxy.startup.bind`; each that fails logs `mbproxy.startup.bind.failed` and continues to the next PLC (matching the design's "eager, continue on per-port failure" posture). After all bind attempts, logs `mbproxy.startup.ready` with `{ ListenersBound, PlcsConfigured }`. On stop: cancels and disposes all listeners and their open pairs. -7. **`Program.cs`** — remove the HeartbeatWorker registration; register `ProxyWorker`. Also register `IPduPipeline` as a singleton `NoopPduPipeline` in DI. - -## Public surface declared in this phase - -All `internal sealed class` — the proxy types are not consumed outside this assembly. The only public-shaped surfaces are the `IPduPipeline` interface and the `MbapDirection` enum (so phase 04 can implement its own pipeline cleanly). - -```csharp -namespace Mbproxy.Proxy; - -public interface IPduPipeline { - void Process(MbapDirection direction, ReadOnlySpan mbapHeader, Span pdu, PduContext context); -} - -public enum MbapDirection { RequestToBackend, ResponseToClient } - -public sealed class PduContext { - public string PlcName { get; init; } = ""; - // Phase 04 adds: BcdTagMap, counters, logger -} - -internal sealed class NoopPduPipeline : IPduPipeline { /* no-op */ } -internal sealed class MbapFrame { /* static helpers */ } -internal sealed class PlcListener : IAsyncDisposable { /* ... */ } -internal sealed class PlcConnectionPair : IAsyncDisposable { /* ... */ } -internal sealed class ProxyWorker : BackgroundService { /* ... */ } -``` - -## Tests required - -### Unit (`Category = Unit`) - -`MbapFrameTests` (≥ 8 tests): - -1. `TryParseHeader_TooShort_ReturnsFalse` -2. `TryParseHeader_ValidFrame_ParsesAllFields` -3. `TryParseHeader_ProtocolId_NotZero_StillParses` — we don't reject non-zero protocol IDs; that's the PLC's job. -4. `TotalFrameLength_LengthField7_Returns13` -5. `TotalFrameLength_LengthFieldMax_Returns_LengthFieldPlus6` -6. Round-trip: parse a known good FC03 frame and assert each field. -7. Round-trip: parse a known good FC16 write-multiple frame. -8. Negative: a frame with `length < 2` returns the parsed value but is callers' responsibility to reject. Document in a test. - -### E2E (`Category = E2E`) - -`ProxyForwardingTests` (≥ 5 tests, `[Collection(nameof(DL205SimulatorCollection))]`): - -1. `Forward_FC03_HR0_Returns_SimulatorRawValue_0xCAFE` — proxy is transparent; client sees the raw simulator value. -2. `Forward_FC03_HR1072_Returns_RawBCD_0x1234` — the BCD register is NOT rewritten in phase 03 (NoopPduPipeline). This test will be REPLACED in phase 04 with one that asserts `1234` instead. Document the planned replacement in a comment so phase 04's agent knows what to update. -3. `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips` — proves the write path forwards correctly. -4. `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`. -5. `MbapTxId_IsPreservedEndToEnd` — issue 20 back-to-back FC03 reads with monotonically increasing TxIds; assert every response carries the matching TxId. -6. `BackendConnectFailure_ClosesUpstreamCleanly` — point the proxy at an unreachable backend (`127.0.0.1:1`), assert the client's socket is closed within `BackendConnectTimeoutMs + 200ms`. - -## Phase gate - -- [ ] Zero-warnings build. -- [ ] All phase 00, 02 tests still green. -- [ ] All new unit tests green (≥ 8 in MbapFrameTests). -- [ ] All new e2e tests green when the simulator is available; skip cleanly when it isn't. -- [ ] `dotnet run --project src/Mbproxy` with an appsettings.json pointing at the simulator: NModbus can read/write through the proxy and gets the simulator's raw values. -- [ ] On startup with one bad and one good PLC config, the good one binds and the bad one logs `mbproxy.startup.bind.failed`, and the service does NOT abort. (Hand the supervisor work to phase 05; this phase only proves the "continue on per-port failure" posture.) -- [ ] `mbproxy.startup.ready` is now logged by `ProxyWorker`, not by a heartbeat worker. The heartbeat worker file is deleted. - -## Out of scope - -- BCD rewriting (phase 04 replaces `NoopPduPipeline`). -- Polly retries on backend connect (phase 05 supervisor wraps this). -- Auto-recovery for failed listener binds (phase 05). -- Counter tracking / per-PLC telemetry (phase 04 starts adding counters via `PduContext`). -- Half-MBAP-frame handling (split TCP packets): rely on `NetworkStream.ReadAsync` returning short reads; loop to fill the header (7 bytes) and then loop to fill the body (`length - 1` more bytes). Test 5 above verifies this stays correct over 20 back-to-back requests. - -## Notes for the subagent - -- `Socket` vs `TcpClient`: prefer `Socket` directly so framing reads can use `ReadOnlyMemory` without `NetworkStream` allocation overhead. The performance difference is small but the byte-precise API matches what the rewriter in phase 04 will need. -- Frame reads use a per-pair pooled buffer of 260 bytes (MBAP header 7 + max PDU 253). Don't allocate per-frame. -- The "Phase 04 will replace test 2" pattern is intentional. Leave breadcrumbs so the next phase's agent knows exactly which test to update; do NOT silently make the test pass against a future rewriter. -- Both forwarder tasks run with the same `CancellationTokenSource`. Cancellation propagates from listener stop → pair stop → both task ends → socket dispose. diff --git a/mbproxy/docs/plan/04-rewriter-integration.md b/mbproxy/docs/plan/04-rewriter-integration.md deleted file mode 100644 index e0303f4..0000000 --- a/mbproxy/docs/plan/04-rewriter-integration.md +++ /dev/null @@ -1,146 +0,0 @@ -# Phase 04 — Rewriter integration - -Replace `NoopPduPipeline` with the real BCD rewriter. After this phase, FC03/FC04 responses have their configured BCD slots decoded to binary integers on the way to the client, and FC06/FC16 requests have their configured BCD slots encoded to nibbles on the way to the PLC. Counters and warnings come online here. - -**Depends on:** Phase 02 (codec + tag map), Phase 03 (plumbing + `IPduPipeline`). -**Parallel-safe with:** nothing (it integrates two prior phases' outputs). - -## Goal - -Wire `BcdTagMap` + `BcdCodec` into the proxy at the single hook point `IPduPipeline.Process(...)`. The rewriter is responsible for: - -- FC03 / FC04 responses: re-encode every covered slot from raw nibbles into a binary integer. -- FC06 / FC16 requests: re-encode every covered slot from binary integer into raw BCD nibbles. -- Partial-overlap of 32-bit pairs: pass through raw, emit `mbproxy.rewrite.partial_bcd` warning, increment partial-overlap counter. -- Bad BCD nibbles in a PLC response: pass through raw, emit `mbproxy.rewrite.invalid_bcd` (new event in this phase) at Warning, increment invalid-bcd counter. NEVER throw out of the pipeline. -- Increment per-pair counters for `pdus.forwarded`, `pdus.byFc`, `pdus.rewrittenSlots`, `pdus.partialBcdWarnings`, `pdus.invalidBcdWarnings`. - -The transparency contract holds: MBAP header bytes are untouched, length field is unchanged (re-encoded slots are the same byte width), TxId / unit ID flow through. - -## Outputs - -``` -src/Mbproxy/Proxy/BcdPduPipeline.cs # replaces NoopPduPipeline -src/Mbproxy/Proxy/PerPlcContext.cs # the per-PLC context (BcdTagMap + counters + logger) -src/Mbproxy/Proxy/ProxyCounters.cs # System.Threading.Interlocked counters -src/Mbproxy/Proxy/RewriterLogEvents.cs # [LoggerMessage] static partial methods - -tests/Mbproxy.Tests/Proxy/BcdPduPipelineTests.cs # unit tests against synthetic PDU bytes -tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs # e2e against the simulator -``` - -Modifications: -- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — replace `PduContext` (placeholder from phase 03) with `PerPlcContext`. Counters increment inline. The pipeline call site is unchanged in shape; only the context type and pipeline registration differ. -- `src/Mbproxy/Proxy/ProxyWorker.cs` — build one `PerPlcContext` per configured PLC at startup (calls `BcdTagMapBuilder.Build` and wraps the resulting map + a fresh `ProxyCounters` + a per-PLC logger). Stash the contexts in a `Dictionary` keyed by PLC name. -- `src/Mbproxy/Program.cs` — register `BcdPduPipeline` as the `IPduPipeline` singleton; remove the `NoopPduPipeline` registration. The phase 03 `NoopPduPipeline.cs` file stays (it's useful in tests as a baseline) but is no longer wired in production. -- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs` — update the test `Forward_FC03_HR1072_Returns_RawBCD_0x1234` (which was a phase-03 baseline) to a new test `Forward_FC03_HR1072_Returns_Decoded_1234` that asserts `1234`. The original raw-passthrough behaviour is preserved by configuring a PLC with NO BCD tags. - -## Tasks - -1. **`ProxyCounters.cs`** — `internal sealed class` holding `long` fields accessed via `Interlocked.Increment` / `Interlocked.Read`. Fields cover the per-PLC counter list from [`../design.md`](../design.md) → Status page → Per-PLC fields. Methods: - - `void IncrementPdusForwarded()`, `void IncrementFcCount(byte fc)`, `void AddRewrittenSlots(int n)`, `void IncrementPartialBcd()`, `void IncrementInvalidBcd()`, `void IncrementBackendException(byte code)`, `void AddBytes(long up, long down)`. - - `CounterSnapshot Snapshot()` — returns an immutable record with all the values; consumed by phase 07's status page. -2. **`PerPlcContext.cs`** — `internal sealed class` holding `string PlcName`, `BcdTagMap TagMap`, `ProxyCounters Counters`, `ILogger Logger`. Constructed once per PLC at startup; lifetime = lifetime of the listener. -3. **`BcdPduPipeline.cs`** — implements `IPduPipeline`. Behaviour per direction: - - **`RequestToBackend`**: inspect the PDU's function code byte (`pdu[0]`): - - FC06: read `(address, value)` from `pdu[1..]`. If `TagMap.TryGet(address)` and Width=16, replace value bytes with `BcdCodec.Encode16(value)`. If Width=32 and this is the LOW address, it's a single-register write to half a 32-bit tag — pass through raw + warn (the design's partial-overlap policy). If `address` is the HIGH register of a 32-bit pair, same partial-pass-through + warn. The PDU length is unchanged. - - FC16: `TryGetForRange(start, qty)`; for each hit, re-encode the relevant register-pair-or-singleton. Partial-overlap warnings emitted per offending slot. - - All other FCs: no-op. - - **`ResponseToClient`**: inspect `pdu[0]`: - - FC03 / FC04: `TryGetForRange(echoedStart, byteCount/2)`. The start address isn't in the response (Modbus FC03 response = `[fc, byteCount, ...data]`), so the rewriter needs the matching request — see Task 4. - - All other FCs: no-op. - - Exceptions from `BcdCodec.Decode*` are caught and turned into `mbproxy.rewrite.invalid_bcd` warnings; the byte is passed through unchanged. -4. **Request → response correlation.** The rewriter on a response needs the original request's start-address and quantity. Since the proxy is 1:1 per-client (no multiplexing), `PlcConnectionPair` keeps the last-issued request's `(fc, address, quantity)` in a per-pair slot. When the response arrives, the rewriter is invoked with that slot's contents as part of `PerPlcContext`. (We do NOT support pipelined multi-PDU requests on one socket in this phase; if a client tries, the slot is overwritten and the second response could mis-decode. Document the limitation; phase 08 may revisit if real clients pipeline.) -5. **`RewriterLogEvents.cs`** — `[LoggerMessage]` source-generated definitions: - - `mbproxy.rewrite.partial_bcd` — Warning, params: PlcName, Address, ClientStart, ClientQty. - - `mbproxy.rewrite.invalid_bcd` — Warning, params: PlcName, Address, RawValue, Direction. - - `mbproxy.exception.passthrough` — Information, params: PlcName, Fc, ExceptionCode. (Moved here from a phase-03 TODO.) - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Proxy; - -internal sealed class BcdPduPipeline : IPduPipeline { /* full impl */ } -internal sealed class PerPlcContext { public string PlcName; public BcdTagMap TagMap; public ProxyCounters Counters; public ILogger Logger; } -internal sealed class ProxyCounters { - public void IncrementPdusForwarded(); - public void IncrementFcCount(byte fc); - public void AddRewrittenSlots(int n); - public void IncrementPartialBcd(); - public void IncrementInvalidBcd(); - public void IncrementBackendException(byte code); - public void AddBytes(long up, long down); - public CounterSnapshot Snapshot(); -} -public sealed record CounterSnapshot(/* mirrors design.md per-PLC status fields */); -``` - -Nothing else becomes public. - -## Tests required - -### Unit (`Category = Unit`) - -`BcdPduPipelineTests` (≥ 20 tests). Each test builds a synthetic PDU byte array + a `PerPlcContext` with a hand-rolled `BcdTagMap`, calls `pipeline.Process`, and asserts the resulting bytes. - -Coverage matrix: - -| FC | Tag scenario | Expected | Counter delta | -|----|--------------|----------|---------------| -| 03 response | single 16-bit BCD at the read address | bytes replaced with binary-encoded value | `RewrittenSlots += 1` | -| 03 response | full 32-bit BCD pair within read range | both register-bytes replaced with binary-encoded 32-bit value | `RewrittenSlots += 2` | -| 03 response | partial 32-bit (low only, qty=1 at low addr) | bytes unchanged | `PartialBcd += 1` | -| 03 response | partial 32-bit (high only, qty=1 at high addr) | bytes unchanged | `PartialBcd += 1` | -| 03 response | mixed: 16-bit + non-BCD in same read | only the 16-bit slot rewritten | `RewrittenSlots += 1` | -| 03 response | bad nibble (0x12A4) at a 16-bit BCD slot | bytes unchanged | `InvalidBcd += 1` | -| 04 response | 16-bit BCD at the read address | same as FC03 | `RewrittenSlots += 1` | -| 06 request | write to 16-bit BCD address | binary integer in payload → BCD nibbles | `RewrittenSlots += 1` | -| 06 request | write to the LOW addr of a 32-bit pair (qty=1) | bytes unchanged (partial) | `PartialBcd += 1` | -| 06 request | write to the HIGH addr of a 32-bit pair | bytes unchanged (partial) | `PartialBcd += 1` | -| 06 request | write value outside `[0,9999]` for 16-bit | `mbproxy.rewrite.invalid_bcd` Warning; bytes unchanged | `InvalidBcd += 1` | -| 16 request | write multi covering one 16-bit BCD + 3 non-BCD | only the 16-bit slot re-encoded | `RewrittenSlots += 1` | -| 16 request | write multi covering one full 32-bit pair | both registers re-encoded as the CDAB pair | `RewrittenSlots += 2` | -| 16 request | write multi crossing into one half of a 32-bit pair | partial slot passed through; warn | `PartialBcd += 1` | -| 01 / 02 / 05 / 15 | any | no-op | none | -| 03 exception response | exception 02 returned by PLC | bytes unchanged, no rewriting attempted | `BackendExceptions[2] += 1`, `mbproxy.exception.passthrough` logged | - -Additional: -- Counter snapshot reflects increments exactly (no off-by-one). -- Empty `BcdTagMap` produces zero rewrites for any FC. - -### E2E (`Category = E2E`, `[Collection(nameof(DL205SimulatorCollection))]`) - -`RewriterE2ETests` (≥ 6 tests, all against the dl205.json simulator profile): - -1. `Read_HR1072_AsBcd_ReturnsDecoded_1234` — configure the BCD tag at addr 1072 width 16; assert `1234`. -2. `Read_HR1072_AsRaw_WhenNotConfigured_Returns_0x1234` — no BCD tags configured; assert raw `4660`. (Verifies the pipeline is opt-in per tag.) -3. `Write_HR200_AsBcd_StoresEncoded_0x9876` — configure addr 200 width 16. Write decimal 9876 through proxy; read raw from sim, expect `0x9876` (39030). -4. `Read_HR1056_HR1057_AsBcd32_ReturnsDecoded_From_CDAB` — seed an alternate profile (or write via proxy first if the default profile's float32 markers aren't suitable BCD32 fixtures). Verify the CDAB layout end-to-end. -5. `Partial_FC03_OnHighRegisterOf_32BitPair_PassesThroughRaw_AndLogsWarning` — use the in-memory Serilog sink to verify `mbproxy.rewrite.partial_bcd` was logged. -6. `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` — same as phase 03's test 5, but with BCD rewrite in the path. Proves rewriting doesn't tamper with the MBAP header. - -## Phase gate - -- [ ] Zero-warnings build. -- [ ] All phase 00–03 tests still green (with the phase-03 placeholder test renamed/repurposed as described). -- [ ] All new unit tests green (≥ 16 in BcdPduPipelineTests + counter snapshot tests). -- [ ] All new e2e tests green when simulator is available. -- [ ] PDU rewriting NEVER changes the MBAP `length` field; verify in a unit test that re-encoded PDUs are exactly the same byte length as the originals. -- [ ] `ProxyCounters` is allocation-free per increment on the hot path. The `Snapshot()` call may allocate (it's used only by the status page, off the hot path). -- [ ] Log event names match [`../design.md`](../design.md) → Logging table exactly (including the new `mbproxy.rewrite.invalid_bcd` event added here — update design.md in this PR to add the row). - -## Out of scope - -- Auto-recovery of failed listener binds (phase 05). -- Backend-connect retry pipeline (phase 05). -- Counter exposure via HTTP (phase 07). -- Hot-reload of the per-PLC `BcdTagMap` (phase 06). -- Pipelined / multi-PDU-in-flight on a single client socket. The proxy serialises by the design's 1:1 model; if a real client pipelines, document as a known limitation. - -## Notes for the subagent - -- The Modbus FC03/04 response does NOT carry the start address — only the byte count and the register data. You must remember the last request's `(startAddress, quantity)` per `PlcConnectionPair`. This is fine because the proxy is 1:1 and one client = one in-flight request at a time. -- For FC16 requests, the wire format is `[fc, startHi, startLo, qtyHi, qtyLo, byteCount, ...data]`. The PDU passed to the pipeline starts at `fc`. Compute slot offsets from `startAddress + (offsetInData / 2)`. -- Update [`../design.md`](../design.md) → Logging events table to add the new `mbproxy.rewrite.invalid_bcd` event. Do this in the same PR; the doc and the code stay in sync. -- The `mbproxy.exception.passthrough` event was specified in design.md but not wired in phase 03. This phase wires it. If during phase 03 it was already wired by mistake, leave it and remove the TODO comment. diff --git a/mbproxy/docs/plan/05-listener-supervisor.md b/mbproxy/docs/plan/05-listener-supervisor.md deleted file mode 100644 index 85d57e5..0000000 --- a/mbproxy/docs/plan/05-listener-supervisor.md +++ /dev/null @@ -1,125 +0,0 @@ -# Phase 05 — Listener supervisor + auto-recovery - -Wrap each `PlcListener` in a Polly-backed supervisor task. Failed binds (at startup or runtime) are retried per the design's recovery profile. Backend-connect Polly retries that were deferred from phase 03 land here too. - -**Depends on:** Phase 03 (PlcListener, PlcConnectionPair). -**Parallel-safe with:** nothing (changes ProxyWorker, listener lifecycle, and connection-pair connect path simultaneously). - -## Goal - -Eliminate "startup race lost a port, service degraded for hours" as a real failure mode. After this phase, a port temporarily in use at boot will bind once it frees; a backend connect transient failure retries within a tight budget instead of immediately dropping the upstream client. - -State per listener: `bound` / `recovering` / `stopped`. Reported on the status page (phase 07) via counters and a state field. - -## Outputs - -``` -src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # owns one PlcListener; retry pipeline -src/Mbproxy/Proxy/Supervision/SupervisorState.cs # enum + state-snapshot record -src/Mbproxy/Proxy/Supervision/PolicyFactory.cs # builds Polly ResiliencePipelines from ResilienceOptions - -tests/Mbproxy.Tests/Proxy/Supervision/SupervisorTests.cs # port-conflict recovery, runtime-fault recovery -tests/Mbproxy.Tests/Proxy/Supervision/BackendConnectRetryTests.cs # Polly retry on backend connect -tests/Mbproxy.Tests/Proxy/Supervision/PolicyFactoryTests.cs # unit -``` - -Modifications: -- `src/Mbproxy/Proxy/ProxyWorker.cs` — owns a `Dictionary` instead of raw `PlcListener` instances. Stop/start of an individual listener now flows through the supervisor. -- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — backend connect now goes through a Polly pipeline built from `ResilienceOptions.BackendConnect`. Remove the `// Phase 05: wrap in Polly` TODO from phase 03. -- `src/Mbproxy/Proxy/ProxyCounters.cs` — add `RecoveryAttempts` counter and `LastBindError` (last failure message, up to 256 chars). Update `CounterSnapshot` to include them. -- `src/Mbproxy/Proxy/RewriterLogEvents.cs` (or a sibling `SupervisorLogEvents.cs`) — add `[LoggerMessage]` definitions for `mbproxy.listener.recovered` (Info, `Plc`, `Port`, `AttemptCount`) and `mbproxy.backend.failed` (Warning, `Plc`, `Reason`). The latter event name already exists in design.md. - -## Tasks - -1. **`PolicyFactory.cs`** — converts `ResilienceOptions.BackendConnect` and `ResilienceOptions.ListenerRecovery` into `Polly.ResiliencePipeline` instances. Pipelines use `RetryStrategyOptions` with `DelayGenerator` reading from the configured `BackoffMs` arrays. Listener recovery uses a 5-step initial backoff then steady-state at `SteadyStateMs` indefinitely (model as a custom delay generator that returns the steady-state value once the attempt index exceeds the initial array length). -2. **`SupervisorState.cs`** — `enum SupervisorState { Bound, Recovering, Stopped }` and a `record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts)`. -3. **`PlcListenerSupervisor.cs`** — - - Constructor: takes a `PlcOptions`, a `PerPlcContext`, the recovery `ResiliencePipeline`, and an `IPduPipeline`. Internally instantiates `PlcListener` lazily inside the retry loop. - - `StartAsync(CancellationToken)`: launches a supervisor task. Inside the task: call `_listener.StartAsync()`. On success, transition to `Bound`, log `mbproxy.startup.bind` (first attempt) or `mbproxy.listener.recovered` (subsequent), and `await _listener.RunAsync(ct)` — which returns when the listener accepts loop ends. - - On exception or normal-but-faulted return from the listener: transition to `Recovering`, log `mbproxy.startup.bind.failed`, increment `RecoveryAttempts`, dispose the failed listener, await Polly's next delay, retry. - - `StopAsync`: transition to `Stopped`, cancel the supervisor token, await the supervisor task. - - `Snapshot()`: returns `SupervisorSnapshot` for the status page. -4. **`PlcConnectionPair.cs` backend-connect retry** — wrap `Socket.ConnectAsync(host, port, ct)` in a `ResiliencePipeline.ExecuteAsync` built from `ResilienceOptions.BackendConnect`. After all attempts exhausted, close the upstream socket (as before) and log `mbproxy.backend.failed`. Crucial: backend-connect retries happen ONCE per upstream client connection (not per request); a connect failure terminates the pair. -5. **`ProxyWorker.cs`** — change to owning supervisors instead of raw listeners. Startup creates one supervisor per `PlcOptions`, starts them all in parallel (`await Task.WhenAll(...)` of their start tasks). The "ready" log event now fires after every supervisor has either reached `Bound` or entered `Recovering`. Shutdown stops all supervisors in parallel; clamp the total shutdown time at 5 s. - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Proxy.Supervision; - -internal sealed class PlcListenerSupervisor : IAsyncDisposable { - public string PlcName { get; } - public Task StartAsync(CancellationToken ct); - public Task StopAsync(CancellationToken ct); - public SupervisorSnapshot Snapshot(); -} - -public sealed record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts); -public enum SupervisorState { Bound, Recovering, Stopped } - -internal static class PolicyFactory { - public static ResiliencePipeline BuildBackendConnect(RetryProfile profile, ILogger logger); - public static ResiliencePipeline BuildListenerRecovery(RecoveryProfile profile, ILogger logger); -} -``` - -`SupervisorSnapshot` is `public` because phase 07 (status page) consumes it. Everything else stays `internal`. - -## Tests required - -### Unit (`Category = Unit`) - -`PolicyFactoryTests` (≥ 4 tests): - -1. `BuildBackendConnect_ProducesPipeline_With3Attempts_Default` -2. `BuildBackendConnect_Backoff_MatchesConfig` — fake `TimeProvider`, assert delay sequence. -3. `BuildListenerRecovery_InitialBackoffFollowedBySteadyState` — drive 10 attempts, assert delays match. -4. `BuildBackendConnect_NoRetry_OnNonTransientException` — `SocketException` with WSAECONNREFUSED is retried; `ArgumentException` is not. - -### Integration (`Category = Unit`; uses real sockets but no simulator) - -`SupervisorTests` (≥ 5 tests): - -1. `Supervisor_StartsListener_AndTransitionsToBound` -2. `Supervisor_StartFails_WhenPortInUse_TransitionsToRecovering` — bind a `TcpListener` on a free port first, then start the supervisor on the same port; assert `State == Recovering` and `LastBindError` is populated within 100 ms. -3. `Supervisor_Recovers_WhenPortFrees` — same setup as test 2, then dispose the blocking listener; assert the supervisor transitions to `Bound` and emits `mbproxy.listener.recovered` within `InitialBackoffMs[0] + 500ms`. Use an in-memory Serilog sink to verify the log event. -4. `Supervisor_RuntimeFault_TriggersRecovery` — replace the listener implementation with a faulting fake (or use reflection to force `_listener` to be one) and assert recovery kicks in. -5. `Supervisor_Stop_CleanlyTransitionsTo_Stopped_AndCancelsRetry` — supervisor in `Recovering` state, call `StopAsync`, assert it returns within 1 s without waiting out the next backoff window. - -`BackendConnectRetryTests` (≥ 3 tests): - -1. `BackendConnect_RetriesPerPipeline_OnConnectionRefused` — point a `PlcConnectionPair` at `127.0.0.1:1`, assert it sees exactly 3 connect attempts with the configured delays. -2. `BackendConnect_Succeeds_OnSecondAttempt_WhenBackendBecomesReachable` — start the pair against a closed port, open a listener on that port mid-backoff, assert connect succeeds and the pair runs. -3. `BackendConnect_AllAttemptsFail_ClosesUpstream` — pair gets a fresh upstream socket, never reaches a backend, the upstream socket is closed within `BackoffMs.Sum() + tolerance`. - -### E2E (`Category = E2E`) - -`SupervisorE2ETests` (≥ 2 tests, against the simulator): - -1. `E2E_Recovery_When_BlockingListenerReleasesPort` — same shape as the unit recovery test, but with the simulator on the backend; confirms the supervisor doesn't disrupt the simulator-facing path during recovery. -2. `E2E_RecoveryAttempts_CounterIncrements_Visible_OnSnapshot` — drives the supervisor into recovery and back, then asserts `counters.RecoveryAttempts > 0`. Phase 07 will surface this on the HTTP endpoint; here we just verify the counter snapshot. - -## Phase gate - -- [ ] Zero-warnings build. -- [ ] All phase 00–04 tests still green. -- [ ] All new unit + integration tests green. -- [ ] E2E recovery test green when simulator is available. -- [ ] `mbproxy.listener.recovered` event log includes `AttemptCount` field. -- [ ] No deadlocks under StopAsync while supervisor is mid-backoff (verify by the test above). -- [ ] Backend-connect failures from phase 03 are now wrapped in Polly; the TODO comment from phase 03 is gone. -- [ ] [`../design.md`](../design.md) → "Listener auto-recovery" matches implementation. If during implementation the backoff arrays needed tweaking, update design.md in this PR. - -## Out of scope - -- Hot-reload-driven add/remove of supervisors (phase 06 owns reconcile). -- HTTP exposure of supervisor state (phase 07). -- Restart-from-crash diagnostics, Windows EventLog integration (phase 08). -- Adaptive backoff (e.g., jitter, exponential beyond the configured array). Stick to the configured schedule. - -## Notes for the subagent - -- Polly v8 (`Polly.Core`) is the target — `ResiliencePipeline` and `RetryStrategyOptions`, not the v7 `Policy.Handle<>()` fluent API. If the package version pinned in phase 00 turns out to be v7, bump it in this phase and note the bump in the csproj comment. -- The supervisor task uses one `CancellationTokenSource` per supervisor instance. Cancelling it must cancel both the Polly delay AND the inner `_listener.RunAsync` cleanly. Polly's `ResiliencePipeline.ExecuteAsync(ct)` honours the token; double-check the listener does too. -- Do not introduce a generic "task supervisor" abstraction. `PlcListenerSupervisor` is the only thing supervising in this codebase; YAGNI on the framework. -- The supervisor must NOT swallow exceptions from `_listener.RunAsync` other than `OperationCanceledException`. Log them at Warning with the exception, then enter the recovery loop. Operators reading logs need to see WHY a listener died, not just that it was restarted. diff --git a/mbproxy/docs/plan/06-hot-reload.md b/mbproxy/docs/plan/06-hot-reload.md deleted file mode 100644 index 2132b01..0000000 --- a/mbproxy/docs/plan/06-hot-reload.md +++ /dev/null @@ -1,158 +0,0 @@ -# Phase 06 — Configuration hot-reload - -Subscribe to `IOptionsMonitor.OnChange` and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host. - -**Depends on:** Phase 05 (supervisor lifecycle). -**Parallel-safe with:** nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options). - -## Goal - -A `appsettings.json` save propagates per the design's reconcile table: - -| Change | Action | -|--------|--------| -| `BcdTags.Global` add/remove/width | Rebuild every PLC's `BcdTagMap`, swap atomically. Next PDU sees it. | -| `Plcs[i].BcdTags.{Add,Remove}` | Rebuild that PLC's `BcdTagMap` only. | -| New `Plcs[i]` | Create supervisor + context, start it. | -| Removed `Plcs[i]` | Stop supervisor, close all client connections to it. | -| Changed `ListenPort` / `Host` | Stop + start the supervisor (remove + add semantics). | -| `Connection.Backend*TimeoutMs` | Take effect on the next backend connect / request. | -| Invalid reload | Reject as a whole; keep current state; log `mbproxy.config.reload.rejected`. | - -Validation runs FIRST. A reload that would produce duplicate `ListenPort` values, or a `BcdTagMapBuilder.Build` error for any PLC, is rejected atomically before any state mutates. - -## Outputs - -``` -src/Mbproxy/Configuration/ConfigReconciler.cs # OnChange handler; orchestrates the apply -src/Mbproxy/Configuration/ReloadValidator.cs # cross-PLC validation (duplicate ports, etc.) -src/Mbproxy/Configuration/ReloadPlan.cs # immutable diff record between current and new - -tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs -tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs -tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs # real appsettings.json mutation, real host -``` - -Modifications: -- `src/Mbproxy/Proxy/ProxyWorker.cs` — accept a `ConfigReconciler` and forward `IOptionsMonitor.OnChange` to it; on startup, also seed the reconciler with the initial snapshot. -- `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs` — expose a `Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)` that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.) -- Add `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` `[LoggerMessage]` events. -- `src/Mbproxy/Options/MbproxyOptions.cs` — wire `IValidateOptions` to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by `ReloadValidator` because it requires inspecting multiple `Plcs[i]` together, which `IValidateOptions` doesn't naturally express. - -## Tasks - -1. **`ReloadPlan.cs`** — immutable record describing the diff: - ```csharp - public sealed record ReloadPlan( - IReadOnlyList ToAdd, - IReadOnlyList ToRemove, // PLC names - IReadOnlyList<(string Name, PlcOptions New)> ToRestart, // port or host changed - IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, // tag map changed - ConnectionOptions Connection); - ``` - Computed by a pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`; PLC identity is keyed on `Name` (NOT on `ListenPort`, which is mutable). -2. **`ReloadValidator.cs`** — single static method `Validate(MbproxyOptions next, out IReadOnlyList errors)`: - - PLC names are unique and non-empty. - - `ListenPort` values are unique. - - For each PLC, `BcdTagMapBuilder.Build(global, perPlc).Errors` is empty. - - `AdminPort` doesn't collide with any `Plcs[i].ListenPort`. - - All ports are in `[1, 65535]`. -3. **`ConfigReconciler.cs`** — subscribes via constructor-injected `IOptionsMonitor.OnChange`. On change: - - Snapshot the new options. - - Run `ReloadValidator.Validate`. On failure: log `mbproxy.config.reload.rejected` with the error list; do nothing else. - - Compute `ReloadPlan` against the current snapshot. - - Apply the plan in order: - 1. Stop supervisors in `ToRemove` (concurrently). - 2. Stop+restart supervisors in `ToRestart` (concurrently). - 3. Build new `PerPlcContext` for each `ToReseat` entry and call `supervisor.ReplaceContextAsync(newCtx)`. - 4. Build supervisors for `ToAdd`, start them. - - On success: log `mbproxy.config.reload.applied` with summary (`PlcsAdded`, `PlcsRemoved`, `PlcsReseated`, `TagListDelta`). Record `lastReloadUtc` and bump `reloadCount` on a service-wide counter (consumed by phase 07). - - On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.) -4. **`ProxyWorker.cs`** updates — register the reconciler with the host and wire startup to use it for the initial snapshot. - -## Public surface declared in this phase - -```csharp -namespace Mbproxy.Configuration; - -internal sealed class ConfigReconciler : IDisposable { - public ConfigReconciler(IOptionsMonitor monitor, /* dependencies */); - public Task ApplyAsync(MbproxyOptions next, CancellationToken ct); // exposed for tests - public void Dispose(); -} - -public sealed record ReloadPlan( - IReadOnlyList ToAdd, - IReadOnlyList ToRemove, - IReadOnlyList<(string Name, PlcOptions New)> ToRestart, - IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, - ConnectionOptions Connection) { - public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next); -} - -internal static class ReloadValidator { - public static bool Validate(MbproxyOptions next, out IReadOnlyList errors); -} -``` - -## Tests required - -### Unit (`Category = Unit`) - -`ReloadValidatorTests` (≥ 6 tests): - -1. `Validate_DuplicatePlcName_Fails` -2. `Validate_DuplicateListenPort_Fails` -3. `Validate_AdminPortCollidesWith_PlcListenPort_Fails` -4. `Validate_PerPlc_BcdMapBuildError_Fails` -5. `Validate_PortOutOfRange_Fails` -6. `Validate_HappyPath_Passes` - -`ReloadPlanTests` (≥ 5 tests): - -1. `Compute_AddOnePlc_OnlyToAddPopulated` -2. `Compute_RemoveOnePlc_OnlyToRemovePopulated` -3. `Compute_ChangePort_GoesToToRestart_NotToReseat` -4. `Compute_ChangePerPlcTagOverride_GoesToToReseat` -5. `Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart` - -`ConfigReconcilerTests` (≥ 4 tests, using a fake `IOptionsMonitor` + fake supervisor factory): - -1. `Apply_HappyPath_StartsAndStopsSupervisors_PerPlan` -2. `Apply_ValidationFails_NoMutationOccurs_AndLogsRejected` -3. `Apply_ReseatTagMap_DoesNotRestartSupervisor` -4. `Apply_ConcurrentReloads_Are_Serialised` — two rapid changes get processed in order, no interleaving. - -### E2E (`Category = E2E`) - -`HotReloadE2ETests` (≥ 4 tests, using a real `Host.CreateApplicationBuilder` + temp appsettings.json file): - -1. `E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable` — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s. -2. `E2E_RemovePlcAtRuntime_ClosesUpstreamConnections` — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s. -3. `E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234. -4. `E2E_InvalidReload_DoesNotMutateRunningState` — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and `mbproxy.config.reload.rejected` is logged. - -## Phase gate - -- [ ] Zero-warnings build. -- [ ] All phase 00–05 tests still green. -- [ ] All new unit tests green. -- [ ] All e2e hot-reload tests green when the simulator is available. -- [ ] `mbproxy.config.reload.applied` / `.rejected` events match the design's properties list. -- [ ] A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation. -- [ ] The reconciler serializes concurrent `OnChange` notifications (`SemaphoreSlim` or equivalent) so two file saves in quick succession don't race. -- [ ] Counters `service.config.reloadCount` and `service.config.reloadRejectedCount` are bumped correctly. - -## Out of scope - -- Watching for files OTHER than `appsettings.json` (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract. -- Reloading Serilog log levels at runtime. Possible but not in this phase. -- A reload audit log file. The accept/reject events are sufficient. -- Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract. - -## Notes for the subagent - -- `IOptionsMonitor.OnChange` can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last `OnChange` before computing the plan. Document the choice in code. -- The reconciler must NOT block the `OnChange` callback thread for I/O (`StopAsync` etc.). Use `Channel` or a `Task.Run`-style hand-off so the callback returns immediately. -- When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it. -- `BcdTagMapBuilder.Build` is the validator for tag-list well-formedness; do not duplicate that validation in `ReloadValidator`. The validator just calls `Build` and checks the `Errors` list. diff --git a/mbproxy/docs/plan/07-status-page.md b/mbproxy/docs/plan/07-status-page.md deleted file mode 100644 index 9f545bc..0000000 --- a/mbproxy/docs/plan/07-status-page.md +++ /dev/null @@ -1,147 +0,0 @@ -# Phase 07 — Status page - -Stand up the read-only Kestrel-hosted admin endpoint on `Mbproxy.AdminPort`. Two routes — `GET /` (self-contained HTML, meta-refresh 5 s) and `GET /status.json` (the same data as JSON). No admin actions, no auth. - -**Depends on:** Phase 05 (supervisor snapshots), Phase 06 (config reload counters). -**Parallel-safe with:** nothing (touches DI registration + needs counters from both 05 and 06). - -## Goal - -A single port that an operator can open in a browser and see, at a glance: - -- Service uptime, version, last-reload timestamp + counts. -- Every configured PLC's listener state (`bound` / `recovering` / `stopped`), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream. - -Same data is exposed as `/status.json` for scraping (Prometheus textfile, custom Nagios check, etc.). - -## Outputs - -``` -src/Mbproxy/Admin/AdminEndpointHost.cs # owns the Kestrel server lifecycle -src/Mbproxy/Admin/StatusSnapshotBuilder.cs # composes per-PLC + service-wide snapshots -src/Mbproxy/Admin/StatusDto.cs # the wire DTOs for /status.json -src/Mbproxy/Admin/StatusHtmlRenderer.cs # builds the single-page HTML -src/Mbproxy/Admin/AssemblyVersionAccessor.cs # cached version string - -tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs -tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs # HTTP-level; live Kestrel + HttpClient -``` - -Modifications: -- `src/Mbproxy/Mbproxy.csproj` — add `Microsoft.AspNetCore.App` framework reference (the Worker SDK doesn't include ASP.NET Core by default). -- `src/Mbproxy/Program.cs` — register `AdminEndpointHost` as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from `IOptionsMonitor`. -- `src/Mbproxy/Proxy/ProxyCounters.cs` — extend with per-client counters: `IReadOnlyList Snapshot()` includes connected clients with `Remote`, `ConnectedAtUtc`, `PdusForwarded`, `LastRoundTripMs`. -- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — record connect time, expose `RemoteEndpoint`, track round-trip time per request (EWMA via `LastRoundTripMs` field). -- Service-wide counters introduced here: `ServiceCounters` with `UptimeStartedAtUtc`, `LastReloadUtc`, `ReloadCount`, `ReloadRejectedCount`. Wired into `ConfigReconciler` (bump on apply / reject) and the service start path (set started-at). - -## Tasks - -1. **`StatusDto.cs`** — record types matching the design's per-PLC + service-wide field tables verbatim. Use `System.Text.Json` source generation (`JsonSerializerContext`) to keep the response allocation-light: - ```csharp - [JsonSerializable(typeof(StatusResponse))] - internal partial class StatusJsonContext : JsonSerializerContext; - ``` -2. **`StatusSnapshotBuilder.cs`** — pulls from injected `ProxyWorker` (or a slim view of it), `ConfigReconciler`, `ServiceCounters`, and each `PlcListenerSupervisor`. Builds a `StatusResponse` record. Pure logic; no I/O. The builder is `[Sealed]` and constructed once via DI; calling `Build()` is the only operation. -3. **`StatusHtmlRenderer.cs`** — pure function `string Render(StatusResponse status)`. Produces a single HTML document with: - - `` for auto-refresh. - - A header line with service version + uptime + last-reload info. - - A table per PLC. Columns match the per-PLC field set; `listener.state` is colour-coded inline (CSS in a `