mbproxy: initial commit through Phase 9 (TxId multiplexing)

Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 01:49:35 -04:00
parent 2e937228a0
commit 56eee3c563
105 changed files with 18430 additions and 0 deletions
+179
View File
@@ -0,0 +1,179 @@
# Phase 00 — Bootstrap
Scaffold the .NET 10 Worker Service project and the test project. Wire up Generic Host, Serilog, Windows-Service registration, and `MbproxyOptions` POCOs bound via `IOptionsMonitor`. No proxy logic yet — the service starts, logs "ready", and stops cleanly.
**Depends on:** nothing. Must run alone.
**Parallel-safe with:** nothing. Phase 00 owns the initial `.csproj` and solution; subsequent phases append.
## Goal
Produce a minimal but production-shaped host that all subsequent phases plug into. The host must:
- Target `.NET 10` (`net10.0`), be registered as a Windows Service via `Microsoft.Extensions.Hosting.WindowsServices`, and also run as a console under `dotnet run` for local dev.
- Load `appsettings.json` with `reloadOnChange: true`, bind the `"Mbproxy"` section to typed POCOs, and expose them via `IOptionsMonitor<MbproxyOptions>`.
- Use Serilog with console + rolling-file sinks under `%ProgramData%\mbproxy\logs\` (configurable, but default that location).
- Set `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` and `<Nullable>enable</Nullable>` in the csproj. These stay set forever.
## Outputs (files created in this phase)
```
Mbproxy.slnx
src/Mbproxy/Mbproxy.csproj
src/Mbproxy/Program.cs
src/Mbproxy/HostingExtensions.cs # AddMbproxyOptions, AddMbproxySerilog
src/Mbproxy/Options/MbproxyOptions.cs
src/Mbproxy/Options/BcdTagOptions.cs
src/Mbproxy/Options/PlcOptions.cs
src/Mbproxy/Options/ConnectionOptions.cs
src/Mbproxy/Options/ResilienceOptions.cs
src/Mbproxy/Options/BcdTagListOptions.cs # the Global + per-PLC Add/Remove DTOs
src/Mbproxy/Workers/HeartbeatWorker.cs # one-line "service alive" worker; deleted by phase 03
src/Mbproxy/appsettings.json # minimal default with empty Plcs array
tests/Mbproxy.Tests/Mbproxy.Tests.csproj
tests/Mbproxy.Tests/HostSmokeTests.cs
tests/Mbproxy.Tests/Options/MbproxyOptionsBindingTests.cs
.gitignore # add bin/, obj/, .vs/, *.user, tests/sim/.venv/, %ProgramData%\mbproxy\
```
No other files. Phase 00 does NOT create:
- BCD codec types (phase 02)
- Proxy types (phase 03)
- Listener supervisor (phase 05)
- Status page (phase 07)
## Tasks
1. **Create `Mbproxy.slnx`** referencing the two csprojs.
2. **`src/Mbproxy/Mbproxy.csproj`** — `<Project Sdk="Microsoft.NET.Sdk.Worker">`, `TargetFramework=net10.0`, `OutputType=Exe`, `Nullable=enable`, `TreatWarningsAsErrors=true`, `ImplicitUsings=enable`. PackageReferences:
- `Microsoft.Extensions.Hosting` (latest stable for .NET 10)
- `Microsoft.Extensions.Hosting.WindowsServices`
- `Serilog.Extensions.Hosting`
- `Serilog.Settings.Configuration`
- `Serilog.Sinks.Console`
- `Serilog.Sinks.File`
- `Polly` (referenced now so phase 04/05 don't have to touch this csproj for the package; usage is deferred)
3. **`Options/MbproxyOptions.cs`** and siblings — typed POCOs that mirror the appsettings schema in [`../design.md`](../design.md) → Configuration. Keep them plain DTOs (`public sealed class` with init-only properties). Use `IValidateOptions<MbproxyOptions>` for cross-field checks at the **schema** level only (no business rules like "duplicate addresses" — those move to phase 06 along with hot-reload).
4. **`HostingExtensions.cs`** — extension methods on `IHostApplicationBuilder` named `AddMbproxyOptions(IConfiguration)` and `AddMbproxySerilog(IConfiguration)`. Keep `Program.cs` thin: read config, call the two extensions, register `HeartbeatWorker`, run.
5. **`Program.cs`** — Generic Host with `.UseWindowsService()`. `await Host.CreateApplicationBuilder(args)...Build().RunAsync()`. Honour `--console` as a no-op flag for documentation symmetry with the design (the worker SDK + UseWindowsService combo already runs in console mode under `dotnet run`).
6. **`Workers/HeartbeatWorker.cs`** — `BackgroundService` that logs `mbproxy.startup.ready` once after `Task.Delay(100)` (so Serilog has flushed) and then idles. This worker is deleted in phase 03 when the real listener supervisor takes over; it exists so phase 00's smoke test has something to assert.
7. **`appsettings.json`** — minimal, valid against the POCOs, with `Plcs: []`. Include the full key shape (`BcdTags.Global`, `AdminPort`, `Connection`, `Resilience`) so future phases just fill in values.
8. **`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`** — Microsoft.NET.Sdk, `TargetFramework=net10.0`, same `Nullable`/`TreatWarningsAsErrors`. ProjectReference to `src/Mbproxy/Mbproxy.csproj`. PackageReferences:
- `Microsoft.NET.Test.Sdk`
- `xunit` (v3 if a stable release exists; v2 otherwise — record the decision in the csproj comment)
- `xunit.runner.visualstudio`
- `Shouldly`
9. **`HostSmokeTests.cs`** — build the host with `Host.CreateApplicationBuilder` against a synthetic config, start it on a `CancellationTokenSource` with a short deadline, assert it logged `mbproxy.startup.ready` and shut down without unhandled exceptions.
10. **`MbproxyOptionsBindingTests.cs`** — bind a hand-written `Dictionary<string,string>` config source into `MbproxyOptions`, assert all fields populate correctly (including a `Plcs` entry with `BcdTags.Add` and `BcdTags.Remove`).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Options;
public sealed class MbproxyOptions {
public BcdTagListOptions BcdTags { get; init; } = new();
public IReadOnlyList<PlcOptions> Plcs { get; init; } = [];
public int AdminPort { get; init; } = 8080;
public ConnectionOptions Connection { get; init; } = new();
public ResilienceOptions Resilience { get; init; } = new();
}
public sealed class BcdTagListOptions {
public IReadOnlyList<BcdTagOptions> Global { get; init; } = [];
}
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; } // 16 or 32
}
public sealed class PlcOptions {
public string Name { get; init; } = "";
public int ListenPort { get; init; }
public string Host { get; init; } = "";
public PlcBcdOverrides? BcdTags { get; init; }
}
public sealed class PlcBcdOverrides {
public IReadOnlyList<BcdTagOptions> Add { get; init; } = [];
public IReadOnlyList<ushort> Remove { get; init; } = [];
}
public sealed class ConnectionOptions {
public int BackendConnectTimeoutMs { get; init; } = 3000;
public int BackendRequestTimeoutMs { get; init; } = 3000;
}
public sealed class ResilienceOptions {
public RetryProfile BackendConnect { get; init; } = new() { MaxAttempts = 3, BackoffMs = [100, 500, 2000] };
public RecoveryProfile ListenerRecovery { get; init; } = new() {
InitialBackoffMs = [1000, 2000, 5000, 15000, 30000],
SteadyStateMs = 30000,
};
}
public sealed class RetryProfile {
public int MaxAttempts { get; init; }
public IReadOnlyList<int> BackoffMs { get; init; } = [];
}
public sealed class RecoveryProfile {
public IReadOnlyList<int> InitialBackoffMs { get; init; } = [];
public int SteadyStateMs { get; init; }
}
```
```csharp
namespace Mbproxy;
internal static class HostingExtensions {
public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder b);
public static IHostApplicationBuilder AddMbproxySerilog(this IHostApplicationBuilder b);
}
```
```csharp
namespace Mbproxy.Workers;
internal sealed class HeartbeatWorker : BackgroundService { /* logs mbproxy.startup.ready */ }
```
No other public types in this phase.
## Tests required
### Unit (`Category = Unit`, default)
1. `MbproxyOptionsBinding_BindsGlobalBcdTags_From_appsettings`
2. `MbproxyOptionsBinding_BindsPerPlcAddAndRemove`
3. `MbproxyOptionsBinding_DefaultsAreApplied_WhenSectionMissing` (AdminPort=8080, Resilience defaults)
4. `MbproxyOptionsBinding_RejectsInvalidWidth` — IValidateOptions returns Fail for `Width != 16 && Width != 32`. Schema-level only; address-overlap validation is phase 06.
5. `HostSmoke_StartsAndStops_Cleanly_AndLogs_StartupReady` — uses a Serilog sink that captures events to memory; asserts the `mbproxy.startup.ready` event fired at Information.
6. `HostSmoke_ShutdownIsOrdered` — host responds to `StopAsync` within 2 s.
### E2E (`Category = E2E`)
None in this phase. The simulator harness is phase 01.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings.
- [ ] `dotnet test --filter Category!=E2E` — all green, ≥6 tests.
- [ ] `dotnet run --project src/Mbproxy` — service starts, logs `mbproxy.startup.ready` to console within 5 s, exits cleanly on Ctrl-C.
- [ ] `appsettings.json` is a valid JSON document and parses into a populated `MbproxyOptions` instance via the test harness.
- [ ] [`../design.md`](../design.md) is unchanged (this phase introduces no new design decisions).
- [ ] Resource index entry for `docs/plan/00-bootstrap.md` is not needed (the plan README routes there).
## Out of scope
- BCD encode/decode logic (phase 02).
- TcpListener / Modbus framing / byte forwarding (phase 03).
- Polly retry pipelines (referenced as a NuGet, used starting in phase 04/05).
- Address-overlap / duplicate-port validation (phase 06).
- AdminPort HTTP endpoint (phase 07).
- Service install / uninstall scripts (phase 08).
## Notes for the subagent
- Do not create `README.md` for the tool root yet — that's a phase 08 deliverable when there's something installable to document.
- If the `xunit` v3 vs v2 question is unclear at implementation time, prefer v3 if available on NuGet — record the choice in a single-line comment at the top of the test csproj. Future phases must not silently switch.
- Use `LoggerMessage`-source-generated logging (`[LoggerMessage]`) for the heartbeat event so phases that add more log events can follow the same pattern. Set `EventId.Name = "mbproxy.startup.ready"`.
+108
View File
@@ -0,0 +1,108 @@
# Phase 01 — Simulator harness
Wrap the existing pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) as a managed lifecycle for xUnit tests. After this phase, any test class that declares `[Collection(nameof(DL205SimulatorCollection))]` gets a running pymodbus server on a known port, with skip-safe behaviour when Python is unavailable.
**Depends on:** Phase 00 (test project exists).
**Parallel-safe with:** Phase 02, Phase 03. (Touches only `tests/sim/` and `tests/Mbproxy.Tests/Sim/`. Disjoint from codec and proxy work.)
## Goal
Eliminate "did the simulator start?" as a source of flaky tests. Encode the launch / readiness-probe / shutdown / cleanup contract once, in a fixture, so phases 03 / 04 / 05 / 06 / 07 don't each reinvent it. Tests must be able to declare a dependency on the simulator and get a hot port back, OR get a clean skip if the environment can't provide one.
## Outputs
```
tests/sim/run-dl205-sim.ps1 # idempotent launcher; venv-provisioning
tests/sim/README.md # how to run the simulator standalone
tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs
tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs
tests/Mbproxy.Tests/Sim/SimulatorSmokeTests.cs # connects, sends FC03, verifies a seeded BCD register
```
Modifications:
- `.gitignore` already has `tests/sim/.venv/` from phase 00 — verify it's present.
- `tests/Mbproxy.Tests/Mbproxy.Tests.csproj` — add `NModbus` PackageReference (chosen for its small footprint and net10.0 compatibility; record the choice as a top-of-csproj comment). This is the Modbus TCP client used by tests against the simulator from this phase forward.
No other files.
## Tasks
1. **`tests/sim/run-dl205-sim.ps1`** — pure PowerShell. Parameters: `-Profile <path>` (default `../DL260/dl205.json` relative to script), `-Port <int>` (default 5020). Behaviour:
- If `tests/sim/.venv` doesn't exist: `python -m venv tests/sim/.venv`, then `tests/sim/.venv/Scripts/pip.exe install "pymodbus[server]"` pinned to a known version (record version in the script + README).
- Activate the venv (`& tests/sim/.venv/Scripts/activate.ps1`).
- Exec `pymodbus.server run --modbus-config-path <Profile> --modbus-server tcp --port <Port>`. Output streams to stdout/stderr; on script termination, the child server dies with it.
- Exit codes: 0 on clean exit, 1 on venv provisioning failure, 2 on pymodbus launch failure, 3 if the profile file is missing.
2. **`DL205SimulatorFixture : IAsyncLifetime`** —
- `InitializeAsync`: pick a free local port (bind/release a `TcpListener` on `IPEndPoint.Any:0`, capture the port, dispose). Spawn `pwsh -NoProfile -File <run-dl205-sim.ps1> -Port <picked>` via `System.Diagnostics.Process` with `RedirectStandardOutput/Error`. Poll `new TcpClient().ConnectAsync("127.0.0.1", port)` at 100 ms intervals for up to 10 s. If the simulator never accepts a connection, capture stderr tail, set `SkipReason`, and dispose the process.
- `DisposeAsync`: send Ctrl-C to the process group (`Process.Kill(entireProcessTree: true)` on Windows is the pragmatic choice — pymodbus handles SIGTERM gracefully but Windows lacks proper signals; document the tradeoff in a comment). Wait up to 5 s for exit.
- Public surface: `string Host { get; }` (always `127.0.0.1`), `int Port { get; }`, `string? SkipReason { get; }`, `string LogTail { get; }` (last ~50 lines of stderr, for diagnosis).
3. **`DL205SimulatorCollection`** —
```csharp
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
Tests that need the fixture declare `[Collection(nameof(DL205SimulatorCollection))]`.
4. **`SimulatorSmokeTests`** — `[Collection(nameof(DL205SimulatorCollection))] [Trait("Category", "E2E")]`. Three tests:
- `Simulator_AcceptsTcpConnection`
- `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — reads register 0, expects `0xCAFE` (the seeded marker from `dl205.json`). Uses NModbus directly. This proves the dl205.json profile is in fact loaded.
- `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — reads register 1072, expects raw `0x1234` (= 4660). This is the BCD register the proxy will rewrite later; phase 04's e2e test will read the SAME register through the proxy and assert 1234 instead.
5. **`tests/sim/README.md`** — a few lines: "Run `pwsh ./run-dl205-sim.ps1 -Port 5020` to launch the simulator standalone. Used by xUnit tests via `DL205SimulatorFixture`. Requires Python 3.10+; the script provisions a venv on first run."
## Public surface declared in this phase
```csharp
namespace Mbproxy.Tests.Sim;
public sealed class DL205SimulatorFixture : IAsyncLifetime {
public string Host { get; }
public int Port { get; }
public string? SkipReason { get; }
public string LogTail { get; }
public Task InitializeAsync();
public Task DisposeAsync();
}
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
No production code is added in this phase.
## Tests required
### Unit (Category = Unit)
None in this phase. The fixture itself is a test-infrastructure component; its correctness is verified by the e2e smoke tests below.
### E2E (Category = E2E)
1. `Simulator_AcceptsTcpConnection` — open a TCP socket to `fixture.Host:fixture.Port` within the fixture lifetime.
2. `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — NModbus FC03, asserts `0xCAFE`.
3. `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — NModbus FC03, asserts raw `0x1234` (4660).
When `SkipReason` is set, all three skip with `Assert.Skip(fixture.SkipReason)`. The phase gate explicitly verifies that on a machine WITH Python+pymodbus, none of them skip — skips are an environment failure, not a test pass.
## Phase gate
- [ ] `pwsh tests/sim/run-dl205-sim.ps1 -Port 5020` standalone — script provisions a venv on first run, server logs "Modbus TCP server listening" within 10 s, Ctrl-C exits cleanly.
- [ ] On second run: venv exists, script skips provisioning, server starts in < 2 s.
- [ ] On a machine WITHOUT Python: `SkipReason` is non-null and tests skip rather than fail.
- [ ] On a machine WITH Python: `SkipReason` is null, all three e2e smoke tests pass.
- [ ] `dotnet test --filter Category=E2E` is green on the dev machine.
- [ ] `dotnet test --filter Category!=E2E` still green (no regression to phase 00's tests).
- [ ] Build zero-warnings.
- [ ] `tests/sim/README.md` documents the manual launch path.
## Out of scope
- Multiple simultaneous simulators (one fixture instance is enough for all e2e tests via `ICollectionFixture`).
- Alternate profiles selected via `MODBUS_SIM_PROFILE` env var — defer until phase 04 actually needs a partial-overlap scenario; add the env-var support then.
- A C# pymodbus replacement / in-process Modbus mock. The pymodbus profile is the source of truth for DL-series quirks and we're not duplicating it.
- pip-mirror or offline-install support. CI is expected to have network or a pre-warmed venv; if a customer site needs offline install, that's a deployment concern (phase 08).
## Notes for the subagent
- Capture the chosen `pymodbus` version pin in both `run-dl205-sim.ps1` and `tests/sim/README.md` so the version isn't lost across re-provisioning.
- The free-port-picker pattern (bind on `:0`, capture port, dispose, then hand the port to the child process) has an inherent TOCTOU race — another process could grab the port between dispose and pymodbus binding. In practice this is rare; acceptable for tests. Note the trade-off in a comment.
- Pymodbus log output is verbose. Pipe it through a line buffer; only the last ~50 lines need to be available via `LogTail` for diagnosis.
- Do not commit the `.venv/` directory.
+157
View File
@@ -0,0 +1,157 @@
# Phase 02 — BCD codec
Pure logic for encoding integers as DirectLOGIC BCD nibbles and decoding nibbles back. No I/O, no network, no Modbus framing. The codec exposed by this phase is what phase 04 plugs into the proxy.
**Depends on:** Phase 00 (csproj + options POCOs).
**Parallel-safe with:** Phase 01, Phase 03. (All work lives under `src/Mbproxy/Bcd/` and `tests/Mbproxy.Tests/Bcd/` — disjoint from sim harness and proxy plumbing.)
## Goal
A tiny, allocation-free codec library that:
- Encodes a non-negative `int` (capped at the width's range) to either one 16-bit raw register value or a `(low, high)` register pair for 32-bit BCD per the design's CDAB digit-layout rule.
- Decodes one or two raw register values back to an `int`.
- Resolves `Global + per-PLC Add - per-PLC Remove` into an **immutable per-PLC `BcdTagMap`** that the rewriter looks up by Modbus address in O(1).
The codec is the single source of BCD-encoding correctness in the system. Phase 04 must not reimplement any nibble math.
## Outputs
```
src/Mbproxy/Bcd/BcdCodec.cs # static class: Encode16, Decode16, Encode32, Decode32
src/Mbproxy/Bcd/BcdTag.cs # the public record (mirrors design.md exactly)
src/Mbproxy/Bcd/BcdTagMap.cs # immutable, address-keyed lookup; describes per-PLC resolved tags
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolves global + Add - Remove into a map; runs validation
src/Mbproxy/Bcd/BcdValidationError.cs # enum + ValidationResult record
tests/Mbproxy.Tests/Bcd/BcdCodecTests.cs
tests/Mbproxy.Tests/Bcd/BcdTagMapBuilderTests.cs
```
No other files. The proxy plumbing layer doesn't exist yet and isn't touched.
## Tasks
1. **`BcdTag.cs`** — `public sealed record BcdTag(ushort Address, byte Width)` with a static factory `Create(ushort, byte)` that throws on `Width != 16 && Width != 32`. This record is the type phases 04 / 06 / 07 will use.
2. **`BcdCodec.cs`** — `internal static class` with four pure methods. Internal because the proxy is the only consumer; nothing else in the assembly should call these.
- `static ushort Encode16(int value)` — value in `[0, 9999]`; produces the 16-bit BCD register, e.g. `1234 → 0x1234`. Throws `ArgumentOutOfRangeException` if value is out of range.
- `static int Decode16(ushort raw)` — inverse. If any nibble is `>= 0xA`, return a `int.MinValue` sentinel? No — throw `FormatException` with the raw value in the message. The rewriter catches this and surfaces a `mbproxy.rewrite.invalid_bcd` event (event name added in phase 04).
- `static (ushort low, ushort high) Encode32(int value)` — value in `[0, 99_999_999]`; produces the CDAB pair, where `low` = low 4 BCD digits (least-significant) and `high` = high 4 BCD digits (most-significant). Decoded decimal = `high * 10000 + low_as_bcd_decoded`. Throws if out of range.
- `static int Decode32(ushort low, ushort high)` — inverse. Throws `FormatException` if either word has a bad nibble.
3. **`BcdTagMap.cs`** — `public sealed class BcdTagMap` wrapping a frozen address-keyed dictionary. Methods:
- `static BcdTagMap Empty { get; }`
- `bool TryGet(ushort address, out BcdTag tag)` — O(1) lookup.
- `bool TryGetForRange(ushort startAddress, ushort qty, out IEnumerable<(int offset, BcdTag tag)> hits)` — returns every BCD tag whose register footprint intersects `[startAddress, startAddress+qty)`. Offsets are relative to `startAddress`. Used by the rewriter to know which slots in a multi-register PDU to touch.
- `int Count { get; }`, `IEnumerable<BcdTag> All { get; }` — for telemetry / status page.
4. **`BcdTagMapBuilder.cs`** — given `BcdTagListOptions Global` and `PlcBcdOverrides? perPlc`, produce a `(BcdTagMap, ValidationResult)`. Validation rules from design.md:
- Reject duplicate addresses within the resolved list (Add+Global after Remove).
- Reject 32-bit entries whose high register (`Address+1`) collides with any other entry's address (16-bit or 32-bit).
- Warn on `Remove` entries that don't match any address in Global (this is not a failure; the warning rides on `ValidationResult.Warnings`).
- Reject `Width` values other than 16/32 (defensive; phase 00's `IValidateOptions` should already have caught this, but the builder is the last line of defence).
5. **`BcdValidationError.cs`** — `public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }`. `public sealed record ValidationResult(BcdTagMap Map, IReadOnlyList<BcdError> Errors, IReadOnlyList<BcdWarning> Warnings)`. Errors fail the build; warnings ride along.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Bcd;
public sealed record BcdTag(ushort Address, byte Width) {
public static BcdTag Create(ushort address, byte width);
public bool IsThirtyTwoBit => Width == 32;
public ushort HighRegister => (ushort)(Address + 1); // throws if Width != 32
}
public sealed class BcdTagMap {
public static BcdTagMap Empty { get; }
public int Count { get; }
public IEnumerable<BcdTag> All { get; }
public bool TryGet(ushort address, out BcdTag tag);
public bool TryGetForRange(ushort startAddress, ushort qty, out IReadOnlyList<RangeHit> hits);
}
public readonly record struct RangeHit(int OffsetWords, BcdTag Tag);
public static class BcdTagMapBuilder {
public static ValidationResult Build(BcdTagListOptions global, PlcBcdOverrides? perPlc);
}
public sealed record ValidationResult(
BcdTagMap Map,
IReadOnlyList<BcdError> Errors,
IReadOnlyList<BcdWarning> Warnings);
public sealed record BcdError(BcdValidationError Kind, string Message, ushort? Address);
public sealed record BcdWarning(string Message, ushort? Address);
public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }
```
```csharp
namespace Mbproxy.Bcd;
internal static class BcdCodec {
public static ushort Encode16(int value);
public static int Decode16(ushort raw);
public static (ushort low, ushort high) Encode32(int value);
public static int Decode32(ushort low, ushort high);
}
```
## Tests required
### Unit (`Category = Unit`)
`BcdCodecTests` (≥ 16 tests):
1. `Encode16_1234_Returns_0x1234`
2. `Encode16_0_Returns_0x0000`
3. `Encode16_9999_Returns_0x9999`
4. `Encode16_10000_Throws_OutOfRange`
5. `Encode16_Negative_Throws_OutOfRange`
6. `Decode16_0x1234_Returns_1234`
7. `Decode16_0x0000_Returns_0`
8. `Decode16_0x9999_Returns_9999`
9. `Decode16_0x123A_Throws_Format` — bad nibble `A`.
10. `Encode32_12345678_Returns_LowHigh_5678_1234` — verify `low = 0x5678`, `high = 0x1234`.
11. `Encode32_0_Returns_LowHigh_0_0`
12. `Encode32_99999999_Returns_LowHigh_9999_9999`
13. `Encode32_100000000_Throws_OutOfRange`
14. `Decode32_LowHigh_5678_1234_Returns_12345678`
15. `Decode32_BadNibble_InLow_Throws`
16. `Decode32_BadNibble_InHigh_Throws`
17. `RoundTrip16_AllValuesUnder10000``[Theory]` with `[InlineData]` for boundary values; for the dense check use `[Theory] [MemberData]` enumerating every 100th value. The codec must be `Decode16(Encode16(v)) == v`.
`BcdTagMapBuilderTests` (≥ 10 tests):
1. `Build_EmptyGlobal_EmptyOverride_ReturnsEmptyMap`
2. `Build_GlobalOnly_PopulatesMap`
3. `Build_PerPlcAdd_AppendsToGlobal`
4. `Build_PerPlcRemove_DropsFromGlobal`
5. `Build_AddOverrideSameAddressAsGlobal_AddWidthWins`
6. `Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError`
7. `Build_32BitHighRegOverlaps16BitGlobal_ReturnsOverlappingHighRegisterError`
8. `Build_Remove_OfNonExistentAddress_ReturnsWarning_NotError`
9. `Build_InvalidWidth_ReturnsInvalidWidthError`
10. `Map_TryGetForRange_ReturnsAllHits_InOrder` — covers full overlap, partial overlap (low only, high only), and no overlap.
### E2E (Category = E2E)
None. The codec is pure logic.
## Phase gate
- [ ] Zero-warnings build.
- [ ] `dotnet test --filter Category=Unit` — all green, ≥ 26 new tests.
- [ ] `BcdCodec` is `internal`; nothing outside `Mbproxy.Bcd` calls it directly.
- [ ] `BcdTagMap` has zero allocations on `TryGet` and on the hot `TryGetForRange` path (verify via a microbench note in the test file's docstring; no benchmark project added).
- [ ] [`../design.md`](../design.md) → "BCD tag shape" matches the public record exactly; if the spec drifted during implementation, update design.md in this PR.
## Out of scope
- Signed BCD. Design explicitly excludes it.
- Half-byte / "BCD with sign nibble" variants used by some DL-family math instructions. Not in the design's tag shape.
- The actual PDU-byte-level rewriting (FC parsing, MBAP framing). That's phase 04.
- Telemetry counters. The codec exposes nothing to counters; phase 04 instruments the rewrite pipeline that USES the codec.
## Notes for the subagent
- The DirectLOGIC CDAB digit layout is the most-likely-to-confuse part of this phase. Re-read [`../design.md`](../design.md) → "BCD tag shape" and [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Word Order" before implementing `Encode32`/`Decode32`. The seeded marker in `dl205.json` for the float32 case (`HR[1056]=0x0000, HR[1057]=0x3FC0` for IEEE 1.5) confirms low-word-first; the BCD-32 case is the same word order with BCD nibble semantics inside each word.
- `BcdTagMapBuilder` is single-shot — given inputs, produce a map. There is NO `IObservable<BcdTagMap>` here. Phase 06 owns reload-driven rebuilds and just calls `Build` again.
- `TryGetForRange` is on the hot path for FC03/04 responses. Implementation should pre-bucket BCD tags by 256-register window if it makes the lookup faster, but only if a microbench shows a real win. Don't preoptimise.
+129
View File
@@ -0,0 +1,129 @@
# Phase 03 — Proxy plumbing
The minimum-viable proxy: one `TcpListener` per configured PLC, 1:1 upstream-client ↔ backend-socket, byte-for-byte forwarding both directions, transparent MBAP TxId / unit ID. No BCD rewriting yet — that's phase 04. No supervisor / auto-recovery — that's phase 05.
**Depends on:** Phase 00 (host, options).
**Parallel-safe with:** Phase 02 (BCD codec lives under `src/Mbproxy/Bcd/`; this phase lives under `src/Mbproxy/Proxy/`).
## Goal
Stand up the listener-and-forwarder pair so an e2e test can:
1. Configure the proxy with `Plcs: [{ Host: "127.0.0.1", Port: <simPort>, ListenPort: <proxyPort> }]`.
2. Start the host.
3. Drive NModbus against `127.0.0.1:<proxyPort>` and see the SAME bytes the simulator would return on a direct connection.
The proxy is transparent in this phase. The BCD rewrite hook point is reserved but not wired.
## Outputs
```
src/Mbproxy/Proxy/PlcListener.cs # owns one TcpListener; accepts loop
src/Mbproxy/Proxy/PlcConnectionPair.cs # one upstream socket + one backend socket; forwarder
src/Mbproxy/Proxy/IPduPipeline.cs # the rewrite hook contract (no-op impl in this phase)
src/Mbproxy/Proxy/NoopPduPipeline.cs # the no-op impl
src/Mbproxy/Proxy/ProxyWorker.cs # BackgroundService that owns all PlcListeners
src/Mbproxy/Proxy/MbapFrame.cs # MBAP header parse helpers (length, txid, unit)
tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs # e2e against the simulator
tests/Mbproxy.Tests/Proxy/MbapFrameTests.cs # unit tests for the MBAP parser
```
Modifications:
- `src/Mbproxy/Program.cs` — register `ProxyWorker` as a hosted service. The `HeartbeatWorker` from phase 00 is DELETED in this phase (its job is replaced by ProxyWorker logging `mbproxy.startup.ready` after all listeners are bound).
- `src/Mbproxy/Workers/HeartbeatWorker.cs` — DELETED.
## Tasks
1. **`MbapFrame.cs`** — pure helpers, no allocations. Static methods:
- `static bool TryParseHeader(ReadOnlySpan<byte> buffer, out ushort txId, out ushort protocolId, out ushort length, out byte unitId)` — returns false if buffer.Length < 7.
- `static int TotalFrameLength(ushort lengthField)``lengthField + 6` (7 header bytes minus the 1-byte unit ID which is counted in the length field).
2. **`IPduPipeline.cs`** — the rewrite hook. Single method:
```csharp
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
```
`MbapDirection` is `RequestToBackend` or `ResponseToClient`. `PduContext` carries the per-pair state (counters, PLC name, configured tag map). In phase 03, the only implementation is `NoopPduPipeline` which does nothing.
3. **`NoopPduPipeline.cs`** — empty `Process` method. Registered as the default `IPduPipeline` in DI for this phase. Phase 04 replaces it with the real rewriter.
4. **`PlcConnectionPair.cs`** — owns the upstream `Socket` (or `TcpClient`) handed to it by `PlcListener.Accept`, opens a fresh backend socket to the configured PLC, and runs two `Task`s:
- **Upstream → backend**: read one full MBAP frame at a time (header → length → rest), call `pipeline.Process(RequestToBackend, header, pdu, ctx)`, write the frame to the backend.
- **Backend → upstream**: same shape, with `ResponseToClient`.
Either task ending (socket closed, exception, cancellation) tears down both sides cleanly. No retry loop; that's phase 05.
Backend connect is wrapped in a `try`/`catch` with the configured `BackendConnectTimeoutMs`. Connect failures close the upstream socket immediately and log `mbproxy.backend.failed`. Polly bounded retries on backend connect are **deferred to phase 05** to keep this phase scope tight — note the deferral in code with `// Phase 05: wrap in Polly pipeline`.
5. **`PlcListener.cs`** — owns one `TcpListener` for one PLC. `StartAsync` binds; on bind failure, throws (caller logs `mbproxy.startup.bind.failed` and decides what to do — phase 05 will introduce the supervisor that turns this into a recoverable state). On each accept, hands the socket to a fresh `PlcConnectionPair` and runs it on the thread-pool.
6. **`ProxyWorker.cs`** — `BackgroundService`. On start: enumerates `MbproxyOptions.Plcs`, instantiates one `PlcListener` per entry, starts them all. Each bind that succeeds logs `mbproxy.startup.bind`; each that fails logs `mbproxy.startup.bind.failed` and continues to the next PLC (matching the design's "eager, continue on per-port failure" posture). After all bind attempts, logs `mbproxy.startup.ready` with `{ ListenersBound, PlcsConfigured }`. On stop: cancels and disposes all listeners and their open pairs.
7. **`Program.cs`** — remove the HeartbeatWorker registration; register `ProxyWorker`. Also register `IPduPipeline` as a singleton `NoopPduPipeline` in DI.
## Public surface declared in this phase
All `internal sealed class` — the proxy types are not consumed outside this assembly. The only public-shaped surfaces are the `IPduPipeline` interface and the `MbapDirection` enum (so phase 04 can implement its own pipeline cleanly).
```csharp
namespace Mbproxy.Proxy;
public interface IPduPipeline {
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
}
public enum MbapDirection { RequestToBackend, ResponseToClient }
public sealed class PduContext {
public string PlcName { get; init; } = "";
// Phase 04 adds: BcdTagMap, counters, logger
}
internal sealed class NoopPduPipeline : IPduPipeline { /* no-op */ }
internal sealed class MbapFrame { /* static helpers */ }
internal sealed class PlcListener : IAsyncDisposable { /* ... */ }
internal sealed class PlcConnectionPair : IAsyncDisposable { /* ... */ }
internal sealed class ProxyWorker : BackgroundService { /* ... */ }
```
## Tests required
### Unit (`Category = Unit`)
`MbapFrameTests` (≥ 8 tests):
1. `TryParseHeader_TooShort_ReturnsFalse`
2. `TryParseHeader_ValidFrame_ParsesAllFields`
3. `TryParseHeader_ProtocolId_NotZero_StillParses` — we don't reject non-zero protocol IDs; that's the PLC's job.
4. `TotalFrameLength_LengthField7_Returns13`
5. `TotalFrameLength_LengthFieldMax_Returns_LengthFieldPlus6`
6. Round-trip: parse a known good FC03 frame and assert each field.
7. Round-trip: parse a known good FC16 write-multiple frame.
8. Negative: a frame with `length < 2` returns the parsed value but is callers' responsibility to reject. Document in a test.
### E2E (`Category = E2E`)
`ProxyForwardingTests` (≥ 5 tests, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `Forward_FC03_HR0_Returns_SimulatorRawValue_0xCAFE` — proxy is transparent; client sees the raw simulator value.
2. `Forward_FC03_HR1072_Returns_RawBCD_0x1234` — the BCD register is NOT rewritten in phase 03 (NoopPduPipeline). This test will be REPLACED in phase 04 with one that asserts `1234` instead. Document the planned replacement in a comment so phase 04's agent knows what to update.
3. `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips` — proves the write path forwards correctly.
4. `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`.
5. `MbapTxId_IsPreservedEndToEnd` — issue 20 back-to-back FC03 reads with monotonically increasing TxIds; assert every response carries the matching TxId.
6. `BackendConnectFailure_ClosesUpstreamCleanly` — point the proxy at an unreachable backend (`127.0.0.1:1`), assert the client's socket is closed within `BackendConnectTimeoutMs + 200ms`.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 00, 02 tests still green.
- [ ] All new unit tests green (≥ 8 in MbapFrameTests).
- [ ] All new e2e tests green when the simulator is available; skip cleanly when it isn't.
- [ ] `dotnet run --project src/Mbproxy` with an appsettings.json pointing at the simulator: NModbus can read/write through the proxy and gets the simulator's raw values.
- [ ] On startup with one bad and one good PLC config, the good one binds and the bad one logs `mbproxy.startup.bind.failed`, and the service does NOT abort. (Hand the supervisor work to phase 05; this phase only proves the "continue on per-port failure" posture.)
- [ ] `mbproxy.startup.ready` is now logged by `ProxyWorker`, not by a heartbeat worker. The heartbeat worker file is deleted.
## Out of scope
- BCD rewriting (phase 04 replaces `NoopPduPipeline`).
- Polly retries on backend connect (phase 05 supervisor wraps this).
- Auto-recovery for failed listener binds (phase 05).
- Counter tracking / per-PLC telemetry (phase 04 starts adding counters via `PduContext`).
- Half-MBAP-frame handling (split TCP packets): rely on `NetworkStream.ReadAsync` returning short reads; loop to fill the header (7 bytes) and then loop to fill the body (`length - 1` more bytes). Test 5 above verifies this stays correct over 20 back-to-back requests.
## Notes for the subagent
- `Socket` vs `TcpClient`: prefer `Socket` directly so framing reads can use `ReadOnlyMemory<byte>` without `NetworkStream` allocation overhead. The performance difference is small but the byte-precise API matches what the rewriter in phase 04 will need.
- Frame reads use a per-pair pooled buffer of 260 bytes (MBAP header 7 + max PDU 253). Don't allocate per-frame.
- The "Phase 04 will replace test 2" pattern is intentional. Leave breadcrumbs so the next phase's agent knows exactly which test to update; do NOT silently make the test pass against a future rewriter.
- Both forwarder tasks run with the same `CancellationTokenSource`. Cancellation propagates from listener stop → pair stop → both task ends → socket dispose.
@@ -0,0 +1,146 @@
# Phase 04 — Rewriter integration
Replace `NoopPduPipeline` with the real BCD rewriter. After this phase, FC03/FC04 responses have their configured BCD slots decoded to binary integers on the way to the client, and FC06/FC16 requests have their configured BCD slots encoded to nibbles on the way to the PLC. Counters and warnings come online here.
**Depends on:** Phase 02 (codec + tag map), Phase 03 (plumbing + `IPduPipeline`).
**Parallel-safe with:** nothing (it integrates two prior phases' outputs).
## Goal
Wire `BcdTagMap` + `BcdCodec` into the proxy at the single hook point `IPduPipeline.Process(...)`. The rewriter is responsible for:
- FC03 / FC04 responses: re-encode every covered slot from raw nibbles into a binary integer.
- FC06 / FC16 requests: re-encode every covered slot from binary integer into raw BCD nibbles.
- Partial-overlap of 32-bit pairs: pass through raw, emit `mbproxy.rewrite.partial_bcd` warning, increment partial-overlap counter.
- Bad BCD nibbles in a PLC response: pass through raw, emit `mbproxy.rewrite.invalid_bcd` (new event in this phase) at Warning, increment invalid-bcd counter. NEVER throw out of the pipeline.
- Increment per-pair counters for `pdus.forwarded`, `pdus.byFc`, `pdus.rewrittenSlots`, `pdus.partialBcdWarnings`, `pdus.invalidBcdWarnings`.
The transparency contract holds: MBAP header bytes are untouched, length field is unchanged (re-encoded slots are the same byte width), TxId / unit ID flow through.
## Outputs
```
src/Mbproxy/Proxy/BcdPduPipeline.cs # replaces NoopPduPipeline
src/Mbproxy/Proxy/PerPlcContext.cs # the per-PLC context (BcdTagMap + counters + logger)
src/Mbproxy/Proxy/ProxyCounters.cs # System.Threading.Interlocked counters
src/Mbproxy/Proxy/RewriterLogEvents.cs # [LoggerMessage] static partial methods
tests/Mbproxy.Tests/Proxy/BcdPduPipelineTests.cs # unit tests against synthetic PDU bytes
tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs # e2e against the simulator
```
Modifications:
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — replace `PduContext` (placeholder from phase 03) with `PerPlcContext`. Counters increment inline. The pipeline call site is unchanged in shape; only the context type and pipeline registration differ.
- `src/Mbproxy/Proxy/ProxyWorker.cs` — build one `PerPlcContext` per configured PLC at startup (calls `BcdTagMapBuilder.Build` and wraps the resulting map + a fresh `ProxyCounters` + a per-PLC logger). Stash the contexts in a `Dictionary<string, PerPlcContext>` keyed by PLC name.
- `src/Mbproxy/Program.cs` — register `BcdPduPipeline` as the `IPduPipeline` singleton; remove the `NoopPduPipeline` registration. The phase 03 `NoopPduPipeline.cs` file stays (it's useful in tests as a baseline) but is no longer wired in production.
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs` — update the test `Forward_FC03_HR1072_Returns_RawBCD_0x1234` (which was a phase-03 baseline) to a new test `Forward_FC03_HR1072_Returns_Decoded_1234` that asserts `1234`. The original raw-passthrough behaviour is preserved by configuring a PLC with NO BCD tags.
## Tasks
1. **`ProxyCounters.cs`** — `internal sealed class` holding `long` fields accessed via `Interlocked.Increment` / `Interlocked.Read`. Fields cover the per-PLC counter list from [`../design.md`](../design.md) → Status page → Per-PLC fields. Methods:
- `void IncrementPdusForwarded()`, `void IncrementFcCount(byte fc)`, `void AddRewrittenSlots(int n)`, `void IncrementPartialBcd()`, `void IncrementInvalidBcd()`, `void IncrementBackendException(byte code)`, `void AddBytes(long up, long down)`.
- `CounterSnapshot Snapshot()` — returns an immutable record with all the values; consumed by phase 07's status page.
2. **`PerPlcContext.cs`** — `internal sealed class` holding `string PlcName`, `BcdTagMap TagMap`, `ProxyCounters Counters`, `ILogger Logger`. Constructed once per PLC at startup; lifetime = lifetime of the listener.
3. **`BcdPduPipeline.cs`** — implements `IPduPipeline`. Behaviour per direction:
- **`RequestToBackend`**: inspect the PDU's function code byte (`pdu[0]`):
- FC06: read `(address, value)` from `pdu[1..]`. If `TagMap.TryGet(address)` and Width=16, replace value bytes with `BcdCodec.Encode16(value)`. If Width=32 and this is the LOW address, it's a single-register write to half a 32-bit tag — pass through raw + warn (the design's partial-overlap policy). If `address` is the HIGH register of a 32-bit pair, same partial-pass-through + warn. The PDU length is unchanged.
- FC16: `TryGetForRange(start, qty)`; for each hit, re-encode the relevant register-pair-or-singleton. Partial-overlap warnings emitted per offending slot.
- All other FCs: no-op.
- **`ResponseToClient`**: inspect `pdu[0]`:
- FC03 / FC04: `TryGetForRange(echoedStart, byteCount/2)`. The start address isn't in the response (Modbus FC03 response = `[fc, byteCount, ...data]`), so the rewriter needs the matching request — see Task 4.
- All other FCs: no-op.
- Exceptions from `BcdCodec.Decode*` are caught and turned into `mbproxy.rewrite.invalid_bcd` warnings; the byte is passed through unchanged.
4. **Request → response correlation.** The rewriter on a response needs the original request's start-address and quantity. Since the proxy is 1:1 per-client (no multiplexing), `PlcConnectionPair` keeps the last-issued request's `(fc, address, quantity)` in a per-pair slot. When the response arrives, the rewriter is invoked with that slot's contents as part of `PerPlcContext`. (We do NOT support pipelined multi-PDU requests on one socket in this phase; if a client tries, the slot is overwritten and the second response could mis-decode. Document the limitation; phase 08 may revisit if real clients pipeline.)
5. **`RewriterLogEvents.cs`** — `[LoggerMessage]` source-generated definitions:
- `mbproxy.rewrite.partial_bcd` — Warning, params: PlcName, Address, ClientStart, ClientQty.
- `mbproxy.rewrite.invalid_bcd` — Warning, params: PlcName, Address, RawValue, Direction.
- `mbproxy.exception.passthrough` — Information, params: PlcName, Fc, ExceptionCode. (Moved here from a phase-03 TODO.)
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy;
internal sealed class BcdPduPipeline : IPduPipeline { /* full impl */ }
internal sealed class PerPlcContext { public string PlcName; public BcdTagMap TagMap; public ProxyCounters Counters; public ILogger Logger; }
internal sealed class ProxyCounters {
public void IncrementPdusForwarded();
public void IncrementFcCount(byte fc);
public void AddRewrittenSlots(int n);
public void IncrementPartialBcd();
public void IncrementInvalidBcd();
public void IncrementBackendException(byte code);
public void AddBytes(long up, long down);
public CounterSnapshot Snapshot();
}
public sealed record CounterSnapshot(/* mirrors design.md per-PLC status fields */);
```
Nothing else becomes public.
## Tests required
### Unit (`Category = Unit`)
`BcdPduPipelineTests` (≥ 20 tests). Each test builds a synthetic PDU byte array + a `PerPlcContext` with a hand-rolled `BcdTagMap`, calls `pipeline.Process`, and asserts the resulting bytes.
Coverage matrix:
| FC | Tag scenario | Expected | Counter delta |
|----|--------------|----------|---------------|
| 03 response | single 16-bit BCD at the read address | bytes replaced with binary-encoded value | `RewrittenSlots += 1` |
| 03 response | full 32-bit BCD pair within read range | both register-bytes replaced with binary-encoded 32-bit value | `RewrittenSlots += 2` |
| 03 response | partial 32-bit (low only, qty=1 at low addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | partial 32-bit (high only, qty=1 at high addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | mixed: 16-bit + non-BCD in same read | only the 16-bit slot rewritten | `RewrittenSlots += 1` |
| 03 response | bad nibble (0x12A4) at a 16-bit BCD slot | bytes unchanged | `InvalidBcd += 1` |
| 04 response | 16-bit BCD at the read address | same as FC03 | `RewrittenSlots += 1` |
| 06 request | write to 16-bit BCD address | binary integer in payload → BCD nibbles | `RewrittenSlots += 1` |
| 06 request | write to the LOW addr of a 32-bit pair (qty=1) | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write to the HIGH addr of a 32-bit pair | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write value outside `[0,9999]` for 16-bit | `mbproxy.rewrite.invalid_bcd` Warning; bytes unchanged | `InvalidBcd += 1` |
| 16 request | write multi covering one 16-bit BCD + 3 non-BCD | only the 16-bit slot re-encoded | `RewrittenSlots += 1` |
| 16 request | write multi covering one full 32-bit pair | both registers re-encoded as the CDAB pair | `RewrittenSlots += 2` |
| 16 request | write multi crossing into one half of a 32-bit pair | partial slot passed through; warn | `PartialBcd += 1` |
| 01 / 02 / 05 / 15 | any | no-op | none |
| 03 exception response | exception 02 returned by PLC | bytes unchanged, no rewriting attempted | `BackendExceptions[2] += 1`, `mbproxy.exception.passthrough` logged |
Additional:
- Counter snapshot reflects increments exactly (no off-by-one).
- Empty `BcdTagMap` produces zero rewrites for any FC.
### E2E (`Category = E2E`, `[Collection(nameof(DL205SimulatorCollection))]`)
`RewriterE2ETests` (≥ 6 tests, all against the dl205.json simulator profile):
1. `Read_HR1072_AsBcd_ReturnsDecoded_1234` — configure the BCD tag at addr 1072 width 16; assert `1234`.
2. `Read_HR1072_AsRaw_WhenNotConfigured_Returns_0x1234` — no BCD tags configured; assert raw `4660`. (Verifies the pipeline is opt-in per tag.)
3. `Write_HR200_AsBcd_StoresEncoded_0x9876` — configure addr 200 width 16. Write decimal 9876 through proxy; read raw from sim, expect `0x9876` (39030).
4. `Read_HR1056_HR1057_AsBcd32_ReturnsDecoded_From_CDAB` — seed an alternate profile (or write via proxy first if the default profile's float32 markers aren't suitable BCD32 fixtures). Verify the CDAB layout end-to-end.
5. `Partial_FC03_OnHighRegisterOf_32BitPair_PassesThroughRaw_AndLogsWarning` — use the in-memory Serilog sink to verify `mbproxy.rewrite.partial_bcd` was logged.
6. `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` — same as phase 03's test 5, but with BCD rewrite in the path. Proves rewriting doesn't tamper with the MBAP header.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0003 tests still green (with the phase-03 placeholder test renamed/repurposed as described).
- [ ] All new unit tests green (≥ 16 in BcdPduPipelineTests + counter snapshot tests).
- [ ] All new e2e tests green when simulator is available.
- [ ] PDU rewriting NEVER changes the MBAP `length` field; verify in a unit test that re-encoded PDUs are exactly the same byte length as the originals.
- [ ] `ProxyCounters` is allocation-free per increment on the hot path. The `Snapshot()` call may allocate (it's used only by the status page, off the hot path).
- [ ] Log event names match [`../design.md`](../design.md) → Logging table exactly (including the new `mbproxy.rewrite.invalid_bcd` event added here — update design.md in this PR to add the row).
## Out of scope
- Auto-recovery of failed listener binds (phase 05).
- Backend-connect retry pipeline (phase 05).
- Counter exposure via HTTP (phase 07).
- Hot-reload of the per-PLC `BcdTagMap` (phase 06).
- Pipelined / multi-PDU-in-flight on a single client socket. The proxy serialises by the design's 1:1 model; if a real client pipelines, document as a known limitation.
## Notes for the subagent
- The Modbus FC03/04 response does NOT carry the start address — only the byte count and the register data. You must remember the last request's `(startAddress, quantity)` per `PlcConnectionPair`. This is fine because the proxy is 1:1 and one client = one in-flight request at a time.
- For FC16 requests, the wire format is `[fc, startHi, startLo, qtyHi, qtyLo, byteCount, ...data]`. The PDU passed to the pipeline starts at `fc`. Compute slot offsets from `startAddress + (offsetInData / 2)`.
- Update [`../design.md`](../design.md) → Logging events table to add the new `mbproxy.rewrite.invalid_bcd` event. Do this in the same PR; the doc and the code stay in sync.
- The `mbproxy.exception.passthrough` event was specified in design.md but not wired in phase 03. This phase wires it. If during phase 03 it was already wired by mistake, leave it and remove the TODO comment.
+125
View File
@@ -0,0 +1,125 @@
# Phase 05 — Listener supervisor + auto-recovery
Wrap each `PlcListener` in a Polly-backed supervisor task. Failed binds (at startup or runtime) are retried per the design's recovery profile. Backend-connect Polly retries that were deferred from phase 03 land here too.
**Depends on:** Phase 03 (PlcListener, PlcConnectionPair).
**Parallel-safe with:** nothing (changes ProxyWorker, listener lifecycle, and connection-pair connect path simultaneously).
## Goal
Eliminate "startup race lost a port, service degraded for hours" as a real failure mode. After this phase, a port temporarily in use at boot will bind once it frees; a backend connect transient failure retries within a tight budget instead of immediately dropping the upstream client.
State per listener: `bound` / `recovering` / `stopped`. Reported on the status page (phase 07) via counters and a state field.
## Outputs
```
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # owns one PlcListener; retry pipeline
src/Mbproxy/Proxy/Supervision/SupervisorState.cs # enum + state-snapshot record
src/Mbproxy/Proxy/Supervision/PolicyFactory.cs # builds Polly ResiliencePipelines from ResilienceOptions
tests/Mbproxy.Tests/Proxy/Supervision/SupervisorTests.cs # port-conflict recovery, runtime-fault recovery
tests/Mbproxy.Tests/Proxy/Supervision/BackendConnectRetryTests.cs # Polly retry on backend connect
tests/Mbproxy.Tests/Proxy/Supervision/PolicyFactoryTests.cs # unit
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — owns a `Dictionary<string, PlcListenerSupervisor>` instead of raw `PlcListener` instances. Stop/start of an individual listener now flows through the supervisor.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — backend connect now goes through a Polly pipeline built from `ResilienceOptions.BackendConnect`. Remove the `// Phase 05: wrap in Polly` TODO from phase 03.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — add `RecoveryAttempts` counter and `LastBindError` (last failure message, up to 256 chars). Update `CounterSnapshot` to include them.
- `src/Mbproxy/Proxy/RewriterLogEvents.cs` (or a sibling `SupervisorLogEvents.cs`) — add `[LoggerMessage]` definitions for `mbproxy.listener.recovered` (Info, `Plc`, `Port`, `AttemptCount`) and `mbproxy.backend.failed` (Warning, `Plc`, `Reason`). The latter event name already exists in design.md.
## Tasks
1. **`PolicyFactory.cs`** — converts `ResilienceOptions.BackendConnect` and `ResilienceOptions.ListenerRecovery` into `Polly.ResiliencePipeline` instances. Pipelines use `RetryStrategyOptions<T>` with `DelayGenerator` reading from the configured `BackoffMs` arrays. Listener recovery uses a 5-step initial backoff then steady-state at `SteadyStateMs` indefinitely (model as a custom delay generator that returns the steady-state value once the attempt index exceeds the initial array length).
2. **`SupervisorState.cs`** — `enum SupervisorState { Bound, Recovering, Stopped }` and a `record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts)`.
3. **`PlcListenerSupervisor.cs`** —
- Constructor: takes a `PlcOptions`, a `PerPlcContext`, the recovery `ResiliencePipeline`, and an `IPduPipeline`. Internally instantiates `PlcListener` lazily inside the retry loop.
- `StartAsync(CancellationToken)`: launches a supervisor task. Inside the task: call `_listener.StartAsync()`. On success, transition to `Bound`, log `mbproxy.startup.bind` (first attempt) or `mbproxy.listener.recovered` (subsequent), and `await _listener.RunAsync(ct)` — which returns when the listener accepts loop ends.
- On exception or normal-but-faulted return from the listener: transition to `Recovering`, log `mbproxy.startup.bind.failed`, increment `RecoveryAttempts`, dispose the failed listener, await Polly's next delay, retry.
- `StopAsync`: transition to `Stopped`, cancel the supervisor token, await the supervisor task.
- `Snapshot()`: returns `SupervisorSnapshot` for the status page.
4. **`PlcConnectionPair.cs` backend-connect retry** — wrap `Socket.ConnectAsync(host, port, ct)` in a `ResiliencePipeline.ExecuteAsync` built from `ResilienceOptions.BackendConnect`. After all attempts exhausted, close the upstream socket (as before) and log `mbproxy.backend.failed`. Crucial: backend-connect retries happen ONCE per upstream client connection (not per request); a connect failure terminates the pair.
5. **`ProxyWorker.cs`** — change to owning supervisors instead of raw listeners. Startup creates one supervisor per `PlcOptions`, starts them all in parallel (`await Task.WhenAll(...)` of their start tasks). The "ready" log event now fires after every supervisor has either reached `Bound` or entered `Recovering`. Shutdown stops all supervisors in parallel; clamp the total shutdown time at 5 s.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Supervision;
internal sealed class PlcListenerSupervisor : IAsyncDisposable {
public string PlcName { get; }
public Task StartAsync(CancellationToken ct);
public Task StopAsync(CancellationToken ct);
public SupervisorSnapshot Snapshot();
}
public sealed record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts);
public enum SupervisorState { Bound, Recovering, Stopped }
internal static class PolicyFactory {
public static ResiliencePipeline BuildBackendConnect(RetryProfile profile, ILogger logger);
public static ResiliencePipeline BuildListenerRecovery(RecoveryProfile profile, ILogger logger);
}
```
`SupervisorSnapshot` is `public` because phase 07 (status page) consumes it. Everything else stays `internal`.
## Tests required
### Unit (`Category = Unit`)
`PolicyFactoryTests` (≥ 4 tests):
1. `BuildBackendConnect_ProducesPipeline_With3Attempts_Default`
2. `BuildBackendConnect_Backoff_MatchesConfig` — fake `TimeProvider`, assert delay sequence.
3. `BuildListenerRecovery_InitialBackoffFollowedBySteadyState` — drive 10 attempts, assert delays match.
4. `BuildBackendConnect_NoRetry_OnNonTransientException``SocketException` with WSAECONNREFUSED is retried; `ArgumentException` is not.
### Integration (`Category = Unit`; uses real sockets but no simulator)
`SupervisorTests` (≥ 5 tests):
1. `Supervisor_StartsListener_AndTransitionsToBound`
2. `Supervisor_StartFails_WhenPortInUse_TransitionsToRecovering` — bind a `TcpListener` on a free port first, then start the supervisor on the same port; assert `State == Recovering` and `LastBindError` is populated within 100 ms.
3. `Supervisor_Recovers_WhenPortFrees` — same setup as test 2, then dispose the blocking listener; assert the supervisor transitions to `Bound` and emits `mbproxy.listener.recovered` within `InitialBackoffMs[0] + 500ms`. Use an in-memory Serilog sink to verify the log event.
4. `Supervisor_RuntimeFault_TriggersRecovery` — replace the listener implementation with a faulting fake (or use reflection to force `_listener` to be one) and assert recovery kicks in.
5. `Supervisor_Stop_CleanlyTransitionsTo_Stopped_AndCancelsRetry` — supervisor in `Recovering` state, call `StopAsync`, assert it returns within 1 s without waiting out the next backoff window.
`BackendConnectRetryTests` (≥ 3 tests):
1. `BackendConnect_RetriesPerPipeline_OnConnectionRefused` — point a `PlcConnectionPair` at `127.0.0.1:1`, assert it sees exactly 3 connect attempts with the configured delays.
2. `BackendConnect_Succeeds_OnSecondAttempt_WhenBackendBecomesReachable` — start the pair against a closed port, open a listener on that port mid-backoff, assert connect succeeds and the pair runs.
3. `BackendConnect_AllAttemptsFail_ClosesUpstream` — pair gets a fresh upstream socket, never reaches a backend, the upstream socket is closed within `BackoffMs.Sum() + tolerance`.
### E2E (`Category = E2E`)
`SupervisorE2ETests` (≥ 2 tests, against the simulator):
1. `E2E_Recovery_When_BlockingListenerReleasesPort` — same shape as the unit recovery test, but with the simulator on the backend; confirms the supervisor doesn't disrupt the simulator-facing path during recovery.
2. `E2E_RecoveryAttempts_CounterIncrements_Visible_OnSnapshot` — drives the supervisor into recovery and back, then asserts `counters.RecoveryAttempts > 0`. Phase 07 will surface this on the HTTP endpoint; here we just verify the counter snapshot.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0004 tests still green.
- [ ] All new unit + integration tests green.
- [ ] E2E recovery test green when simulator is available.
- [ ] `mbproxy.listener.recovered` event log includes `AttemptCount` field.
- [ ] No deadlocks under StopAsync while supervisor is mid-backoff (verify by the test above).
- [ ] Backend-connect failures from phase 03 are now wrapped in Polly; the TODO comment from phase 03 is gone.
- [ ] [`../design.md`](../design.md) → "Listener auto-recovery" matches implementation. If during implementation the backoff arrays needed tweaking, update design.md in this PR.
## Out of scope
- Hot-reload-driven add/remove of supervisors (phase 06 owns reconcile).
- HTTP exposure of supervisor state (phase 07).
- Restart-from-crash diagnostics, Windows EventLog integration (phase 08).
- Adaptive backoff (e.g., jitter, exponential beyond the configured array). Stick to the configured schedule.
## Notes for the subagent
- Polly v8 (`Polly.Core`) is the target — `ResiliencePipeline` and `RetryStrategyOptions<T>`, not the v7 `Policy.Handle<>()` fluent API. If the package version pinned in phase 00 turns out to be v7, bump it in this phase and note the bump in the csproj comment.
- The supervisor task uses one `CancellationTokenSource` per supervisor instance. Cancelling it must cancel both the Polly delay AND the inner `_listener.RunAsync` cleanly. Polly's `ResiliencePipeline.ExecuteAsync(ct)` honours the token; double-check the listener does too.
- Do not introduce a generic "task supervisor" abstraction. `PlcListenerSupervisor` is the only thing supervising in this codebase; YAGNI on the framework.
- The supervisor must NOT swallow exceptions from `_listener.RunAsync` other than `OperationCanceledException`. Log them at Warning with the exception, then enter the recovery loop. Operators reading logs need to see WHY a listener died, not just that it was restarted.
+158
View File
@@ -0,0 +1,158 @@
# Phase 06 — Configuration hot-reload
Subscribe to `IOptionsMonitor<MbproxyOptions>.OnChange` and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host.
**Depends on:** Phase 05 (supervisor lifecycle).
**Parallel-safe with:** nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options).
## Goal
A `appsettings.json` save propagates per the design's reconcile table:
| Change | Action |
|--------|--------|
| `BcdTags.Global` add/remove/width | Rebuild every PLC's `BcdTagMap`, swap atomically. Next PDU sees it. |
| `Plcs[i].BcdTags.{Add,Remove}` | Rebuild that PLC's `BcdTagMap` only. |
| New `Plcs[i]` | Create supervisor + context, start it. |
| Removed `Plcs[i]` | Stop supervisor, close all client connections to it. |
| Changed `ListenPort` / `Host` | Stop + start the supervisor (remove + add semantics). |
| `Connection.Backend*TimeoutMs` | Take effect on the next backend connect / request. |
| Invalid reload | Reject as a whole; keep current state; log `mbproxy.config.reload.rejected`. |
Validation runs FIRST. A reload that would produce duplicate `ListenPort` values, or a `BcdTagMapBuilder.Build` error for any PLC, is rejected atomically before any state mutates.
## Outputs
```
src/Mbproxy/Configuration/ConfigReconciler.cs # OnChange handler; orchestrates the apply
src/Mbproxy/Configuration/ReloadValidator.cs # cross-PLC validation (duplicate ports, etc.)
src/Mbproxy/Configuration/ReloadPlan.cs # immutable diff record between current and new
tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs
tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs
tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs # real appsettings.json mutation, real host
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — accept a `ConfigReconciler` and forward `IOptionsMonitor.OnChange` to it; on startup, also seed the reconciler with the initial snapshot.
- `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs` — expose a `Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)` that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.)
- Add `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` `[LoggerMessage]` events.
- `src/Mbproxy/Options/MbproxyOptions.cs` — wire `IValidateOptions<MbproxyOptions>` to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by `ReloadValidator` because it requires inspecting multiple `Plcs[i]` together, which `IValidateOptions` doesn't naturally express.
## Tasks
1. **`ReloadPlan.cs`** — immutable record describing the diff:
```csharp
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove, // PLC names
IReadOnlyList<(string Name, PlcOptions New)> ToRestart, // port or host changed
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, // tag map changed
ConnectionOptions Connection);
```
Computed by a pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`; PLC identity is keyed on `Name` (NOT on `ListenPort`, which is mutable).
2. **`ReloadValidator.cs`** — single static method `Validate(MbproxyOptions next, out IReadOnlyList<string> errors)`:
- PLC names are unique and non-empty.
- `ListenPort` values are unique.
- For each PLC, `BcdTagMapBuilder.Build(global, perPlc).Errors` is empty.
- `AdminPort` doesn't collide with any `Plcs[i].ListenPort`.
- All ports are in `[1, 65535]`.
3. **`ConfigReconciler.cs`** — subscribes via constructor-injected `IOptionsMonitor<MbproxyOptions>.OnChange`. On change:
- Snapshot the new options.
- Run `ReloadValidator.Validate`. On failure: log `mbproxy.config.reload.rejected` with the error list; do nothing else.
- Compute `ReloadPlan` against the current snapshot.
- Apply the plan in order:
1. Stop supervisors in `ToRemove` (concurrently).
2. Stop+restart supervisors in `ToRestart` (concurrently).
3. Build new `PerPlcContext` for each `ToReseat` entry and call `supervisor.ReplaceContextAsync(newCtx)`.
4. Build supervisors for `ToAdd`, start them.
- On success: log `mbproxy.config.reload.applied` with summary (`PlcsAdded`, `PlcsRemoved`, `PlcsReseated`, `TagListDelta`). Record `lastReloadUtc` and bump `reloadCount` on a service-wide counter (consumed by phase 07).
- On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.)
4. **`ProxyWorker.cs`** updates — register the reconciler with the host and wire startup to use it for the initial snapshot.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Configuration;
internal sealed class ConfigReconciler : IDisposable {
public ConfigReconciler(IOptionsMonitor<MbproxyOptions> monitor, /* dependencies */);
public Task ApplyAsync(MbproxyOptions next, CancellationToken ct); // exposed for tests
public void Dispose();
}
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove,
IReadOnlyList<(string Name, PlcOptions New)> ToRestart,
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,
ConnectionOptions Connection) {
public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next);
}
internal static class ReloadValidator {
public static bool Validate(MbproxyOptions next, out IReadOnlyList<string> errors);
}
```
## Tests required
### Unit (`Category = Unit`)
`ReloadValidatorTests` (≥ 6 tests):
1. `Validate_DuplicatePlcName_Fails`
2. `Validate_DuplicateListenPort_Fails`
3. `Validate_AdminPortCollidesWith_PlcListenPort_Fails`
4. `Validate_PerPlc_BcdMapBuildError_Fails`
5. `Validate_PortOutOfRange_Fails`
6. `Validate_HappyPath_Passes`
`ReloadPlanTests` (≥ 5 tests):
1. `Compute_AddOnePlc_OnlyToAddPopulated`
2. `Compute_RemoveOnePlc_OnlyToRemovePopulated`
3. `Compute_ChangePort_GoesToToRestart_NotToReseat`
4. `Compute_ChangePerPlcTagOverride_GoesToToReseat`
5. `Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart`
`ConfigReconcilerTests` (≥ 4 tests, using a fake `IOptionsMonitor` + fake supervisor factory):
1. `Apply_HappyPath_StartsAndStopsSupervisors_PerPlan`
2. `Apply_ValidationFails_NoMutationOccurs_AndLogsRejected`
3. `Apply_ReseatTagMap_DoesNotRestartSupervisor`
4. `Apply_ConcurrentReloads_Are_Serialised` — two rapid changes get processed in order, no interleaving.
### E2E (`Category = E2E`)
`HotReloadE2ETests` (≥ 4 tests, using a real `Host.CreateApplicationBuilder` + temp appsettings.json file):
1. `E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable` — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s.
2. `E2E_RemovePlcAtRuntime_ClosesUpstreamConnections` — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s.
3. `E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234.
4. `E2E_InvalidReload_DoesNotMutateRunningState` — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and `mbproxy.config.reload.rejected` is logged.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0005 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e hot-reload tests green when the simulator is available.
- [ ] `mbproxy.config.reload.applied` / `.rejected` events match the design's properties list.
- [ ] A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation.
- [ ] The reconciler serializes concurrent `OnChange` notifications (`SemaphoreSlim` or equivalent) so two file saves in quick succession don't race.
- [ ] Counters `service.config.reloadCount` and `service.config.reloadRejectedCount` are bumped correctly.
## Out of scope
- Watching for files OTHER than `appsettings.json` (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract.
- Reloading Serilog log levels at runtime. Possible but not in this phase.
- A reload audit log file. The accept/reject events are sufficient.
- Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract.
## Notes for the subagent
- `IOptionsMonitor.OnChange` can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last `OnChange` before computing the plan. Document the choice in code.
- The reconciler must NOT block the `OnChange` callback thread for I/O (`StopAsync` etc.). Use `Channel<ReloadRequest>` or a `Task.Run`-style hand-off so the callback returns immediately.
- When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it.
- `BcdTagMapBuilder.Build` is the validator for tag-list well-formedness; do not duplicate that validation in `ReloadValidator`. The validator just calls `Build` and checks the `Errors` list.
+147
View File
@@ -0,0 +1,147 @@
# Phase 07 — Status page
Stand up the read-only Kestrel-hosted admin endpoint on `Mbproxy.AdminPort`. Two routes — `GET /` (self-contained HTML, meta-refresh 5 s) and `GET /status.json` (the same data as JSON). No admin actions, no auth.
**Depends on:** Phase 05 (supervisor snapshots), Phase 06 (config reload counters).
**Parallel-safe with:** nothing (touches DI registration + needs counters from both 05 and 06).
## Goal
A single port that an operator can open in a browser and see, at a glance:
- Service uptime, version, last-reload timestamp + counts.
- Every configured PLC's listener state (`bound` / `recovering` / `stopped`), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream.
Same data is exposed as `/status.json` for scraping (Prometheus textfile, custom Nagios check, etc.).
## Outputs
```
src/Mbproxy/Admin/AdminEndpointHost.cs # owns the Kestrel server lifecycle
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # composes per-PLC + service-wide snapshots
src/Mbproxy/Admin/StatusDto.cs # the wire DTOs for /status.json
src/Mbproxy/Admin/StatusHtmlRenderer.cs # builds the single-page HTML
src/Mbproxy/Admin/AssemblyVersionAccessor.cs # cached version string
tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs
tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs # HTTP-level; live Kestrel + HttpClient
```
Modifications:
- `src/Mbproxy/Mbproxy.csproj` — add `Microsoft.AspNetCore.App` framework reference (the Worker SDK doesn't include ASP.NET Core by default).
- `src/Mbproxy/Program.cs` — register `AdminEndpointHost` as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from `IOptionsMonitor<MbproxyOptions>`.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — extend with per-client counters: `IReadOnlyList<ClientCounterSnapshot> Snapshot()` includes connected clients with `Remote`, `ConnectedAtUtc`, `PdusForwarded`, `LastRoundTripMs`.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — record connect time, expose `RemoteEndpoint`, track round-trip time per request (EWMA via `LastRoundTripMs` field).
- Service-wide counters introduced here: `ServiceCounters` with `UptimeStartedAtUtc`, `LastReloadUtc`, `ReloadCount`, `ReloadRejectedCount`. Wired into `ConfigReconciler` (bump on apply / reject) and the service start path (set started-at).
## Tasks
1. **`StatusDto.cs`** — record types matching the design's per-PLC + service-wide field tables verbatim. Use `System.Text.Json` source generation (`JsonSerializerContext`) to keep the response allocation-light:
```csharp
[JsonSerializable(typeof(StatusResponse))]
internal partial class StatusJsonContext : JsonSerializerContext;
```
2. **`StatusSnapshotBuilder.cs`** — pulls from injected `ProxyWorker` (or a slim view of it), `ConfigReconciler`, `ServiceCounters`, and each `PlcListenerSupervisor`. Builds a `StatusResponse` record. Pure logic; no I/O. The builder is `[Sealed]` and constructed once via DI; calling `Build()` is the only operation.
3. **`StatusHtmlRenderer.cs`** — pure function `string Render(StatusResponse status)`. Produces a single HTML document with:
- `<meta http-equiv="refresh" content="5">` for auto-refresh.
- A header line with service version + uptime + last-reload info.
- A table per PLC. Columns match the per-PLC field set; `listener.state` is colour-coded inline (CSS in a `<style>` block — no external assets).
- Total page weight under 50 KB for typical fleets; the design's 54-PLC count puts the table at ~54 rows.
4. **`AssemblyVersionAccessor.cs`** — reads `AssemblyInformationalVersionAttribute` once at startup, caches it as a string. Used for the `service.version` field.
5. **`AdminEndpointHost.cs`** — `IHostedService` that:
- On start: builds a `WebApplication` (Kestrel) configured to listen on `AdminPort`. Maps `GET /` to a handler that calls `StatusSnapshotBuilder.Build()` then `StatusHtmlRenderer.Render()`, returning `text/html`. Maps `GET /status.json` to a handler returning `JsonSerializer.Serialize(snapshot, StatusJsonContext.Default.StatusResponse)`. NO other routes.
- If `AdminPort` is in use at startup: log `mbproxy.admin.bind.failed` (new event) at Error, do not throw. The proxy listeners continue to run; only the admin endpoint is missing. Operators see this in logs.
- On hot-reload of `AdminPort`: stop and restart the Kestrel server bound to the new port.
- On stop: `Stop()` the Kestrel app gracefully with a 2 s deadline.
6. **`ServiceCounters.cs`** (under `src/Mbproxy/`) — a singleton DI service holding the service-wide counters. `Initialize(DateTimeOffset startedAtUtc)`; `RecordReloadApplied(DateTimeOffset)`; `RecordReloadRejected()`. Snapshot returns an immutable record.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Admin;
internal sealed class AdminEndpointHost : IHostedService { /* ... */ }
public sealed record StatusResponse(
ServiceFields Service,
ListenersAggregate Listeners,
IReadOnlyList<PlcStatus> Plcs);
public sealed record ServiceFields(
long UptimeSeconds, string Version,
DateTimeOffset? ConfigLastReloadUtc, int ConfigReloadCount, int ConfigReloadRejectedCount);
public sealed record ListenersAggregate(int Bound, int Configured);
public sealed record PlcStatus(
string Name, string Host, int ListenPort,
PlcListenerStatus Listener,
PlcClientsStatus Clients,
PlcPdusStatus Pdus,
PlcBackendStatus Backend,
PlcBytesStatus Bytes);
public sealed record PlcListenerStatus(string State, string? LastBindError, int RecoveryAttempts);
public sealed record PlcClientsStatus(int Connected, IReadOnlyList<ClientSnapshot> RemoteEndpoints);
public sealed record ClientSnapshot(string Remote, DateTimeOffset ConnectedAtUtc, long PdusForwarded);
public sealed record PlcPdusStatus(long Forwarded, FcCounts ByFc, long RewrittenSlots, long PartialBcdWarnings);
public sealed record FcCounts(long Fc03, long Fc04, long Fc06, long Fc16, long Other);
public sealed record PlcBackendStatus(long ConnectsSuccess, long ConnectsFailed, ExceptionCounts ExceptionsByCode, double LastRoundTripMs);
public sealed record ExceptionCounts(long Code01, long Code02, long Code03, long Code04);
public sealed record PlcBytesStatus(long UpstreamIn, long UpstreamOut);
```
## Tests required
### Unit (`Category = Unit`)
`StatusSnapshotBuilderTests` (≥ 6 tests):
1. `Build_NoPlcsConfigured_ReturnsEmptyPlcList`
2. `Build_OnePlcBound_PopulatesListenerState_Bound`
3. `Build_PlcRecovering_PopulatesLastBindError_AndAttempts`
4. `Build_AggregatesListenersBoundAndConfigured`
5. `Build_PerClientSnapshot_Includes_RemoteAndConnectedAt_AndPduCount`
6. `Build_ServiceFields_IncludeUptime_Version_AndLastReload`
`StatusHtmlRendererTests` (≥ 3 tests):
1. `Render_OnePlc_ProducesValidHtml_WithMetaRefresh`
2. `Render_RecoveringPlc_HighlightsState`
3. `Render_PageWeightUnder50KB_For54Plcs` — assert character length.
### E2E (`Category = E2E`)
`AdminEndpointTests` (≥ 5 tests, against a live in-process Kestrel + simulator):
1. `Get_StatusJson_ReturnsValidShape`
2. `Get_StatusJson_AfterReadFC03_ShowsPduCountIncreased`
3. `Get_StatusJson_AfterPartialBcdWrite_ShowsPartialBcdWarning`
4. `Get_Root_ReturnsHtml_WithMetaRefresh`
5. `AdminPort_BindFailure_ServiceStaysUp_AndLogsBindFailed` — pre-bind the AdminPort, start the service, assert proxy listeners come up and the admin endpoint logs the failure.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0006 tests still green.
- [ ] All new unit + e2e tests green.
- [ ] `/status.json` shape matches the field tables in [`../design.md`](../design.md) → "Status page" exactly (field names, casing, nesting).
- [ ] Counters on the read path (`PdusForwarded`, etc.) remain allocation-free; `Snapshot()` is the only allocating call and it's on the cold path.
- [ ] AdminPort collision is logged but does NOT take down the proxy.
- [ ] Hot-reload of `AdminPort` works (verified by adding a test in this phase or extending one of phase 06's e2e tests).
## Out of scope
- Authentication / authorisation on the admin port. Design explicitly defers to network-layer trust.
- Prometheus exposition format. The `/status.json` shape is the contract; downstream tools can transform.
- WebSocket push of counters. Meta-refresh is good enough at 54 PLCs.
- Historical counter retention (rolling windows, time series). Counters are cumulative since process start; restart resets.
- Per-tag-level telemetry (which BCD addresses got rewritten how often). The per-PLC `RewrittenSlots` total is enough; finer granularity goes in a future phase if needed.
## Notes for the subagent
- Use the minimal-API style for the two endpoints; no controllers. The whole admin endpoint is ~50 lines of map / handler code.
- `System.Text.Json` source generation needs `[JsonSerializable]` on the DTO chain. Don't use reflection-based serialization in this codebase — it adds AOT-unsafety and is slower for the simple shape.
- For the HTML page, embed CSS in a `<style>` block. Do not link external stylesheets — the admin endpoint must work over a firewalled network with no internet egress.
- Test 3 of `AdminEndpointTests` requires triggering a partial-BCD warning, which means configuring a 32-bit BCD tag and reading only one half of it through the proxy. This is the same scenario phase 04's e2e test 5 exercised; reuse the setup.
- The admin port collision test is important: an operator misconfiguration must not take down the proxy itself. Log Error, continue running.
+134
View File
@@ -0,0 +1,134 @@
# Phase 08 — Windows service hardening
Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing `README.md` that the root `wwtools/CLAUDE.md` index points at. This is the "ship it" phase.
**Depends on:** Phase 04 (rewriter), Phase 07 (status page).
**Parallel-safe with:** nothing.
## Goal
After this phase, an operator can:
1. `dotnet publish` the service into a self-contained folder.
2. Run `install.ps1` to register it as a Windows service.
3. See it appear in `services.msc` running as `Local System` (default — overridable to a managed service account).
4. Stop it cleanly via `sc.exe stop mbproxy`; the service finishes all in-flight PDUs and exits within 10 s.
5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
6. Read [`../../mbproxy/README.md`](../../mbproxy/README.md) to figure all of this out without needing to talk to a developer.
## Outputs
```
mbproxy/README.md # tool-level human entry point (per DOCS-GUIDE Layer 2)
mbproxy/install/install.ps1 # registers the service
mbproxy/install/uninstall.ps1 # removes it
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
mbproxy/docs/operations.md # ops runbook (install, upgrade, troubleshooting)
src/Mbproxy/Diagnostics/ShutdownCoordinator.cs # graceful-shutdown helper
src/Mbproxy/Diagnostics/EventLogBridge.cs # logs critical events to Windows Event Log
tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs
```
Modifications:
- `src/Mbproxy/Program.cs` — wire `ShutdownCoordinator` into the host-stop signal. Wire `EventLogBridge` as a Serilog sub-sink for events at Error and above when running under Windows Service (`WindowsServiceHelpers.IsWindowsService()` true).
- `mbproxy/Mbproxy.csproj``<PublishSingleFile>true</PublishSingleFile>` and `<SelfContained>true</SelfContained>` for the publish profile.
- `../CLAUDE.md` (the root `wwtools/CLAUDE.md`) — update the `mbproxy` index row to point at the new `mbproxy/README.md` (per the maintenance note in `mbproxy/CLAUDE.md`).
- `mbproxy/CLAUDE.md` — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.
## Tasks
1. **`mbproxy/README.md`** — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to `docs/design.md`, `docs/plan/README.md`, `docs/operations.md`, `CLAUDE.md`. No deep prose tutorials; the README routes.
2. **`mbproxy/install/install.ps1`** — parameters: `-InstallPath <path>` (default `C:\Program Files\Mbproxy`), `-ServiceName <name>` (default `mbproxy`), `-DisplayName <text>`, `-Account <managed-service-account>` (default `LocalSystem`). Behaviour:
- Verifies admin rights; fails with a clear message if not elevated.
- Copies the publish output (passed via `-PublishOutput <path>`) to `InstallPath`.
- Runs `sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>`.
- Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (`sc.exe failure ...`).
- Creates `%ProgramData%\mbproxy\logs\` with appropriate ACLs.
- Copies `mbproxy.config.template.json` to `%ProgramData%\mbproxy\appsettings.json` if no config exists.
- Optionally starts the service if `-Start` flag is passed.
3. **`mbproxy/install/uninstall.ps1`** — stops the service if running, `sc.exe delete <ServiceName>`, removes `InstallPath` (with `-KeepConfig` flag to preserve `%ProgramData%\mbproxy\appsettings.json`).
4. **`mbproxy/install/mbproxy.config.template.json`** — a fully commented `appsettings.json` showing the full schema with example values and inline `//` comments describing every field. (Use `appsettings.jsonc` semantics; .NET's configuration loader tolerates `//` comments when configured to.)
5. **`ShutdownCoordinator.cs`** — orchestrates graceful shutdown on `IHostApplicationLifetime.ApplicationStopping`:
- Stop accepting new upstream connections on all `PlcListenerSupervisor`s.
- Wait for in-flight PDUs to complete with a `10 s` deadline (configurable via `Connection.GracefulShutdownTimeoutMs`, default 10000).
- Stop the admin endpoint.
- Cancel all remaining work. Log `mbproxy.shutdown.complete` with `InFlightAtCancel` count.
6. **`EventLogBridge.cs`** — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source `mbproxy`. Only enabled when running as a Windows Service. The install script creates the event source.
7. **`mbproxy/docs/operations.md`** — operations runbook:
- Install / uninstall steps (mirror to `README.md`).
- Upgrade procedure (stop service, copy new binaries, start).
- Where logs live, how to roll them, retention defaults.
- Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
- The `services.msc` / `sc.exe` / `Get-Service` commands operators will actually use.
- How to safely edit `appsettings.json` for hot-reload (with the rejection-keeps-old-config promise).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Diagnostics;
internal sealed class ShutdownCoordinator {
public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
}
internal sealed class EventLogBridge { /* Serilog sub-sink */ }
```
No additional public types are needed; all surfaces from previous phases remain stable.
## Tests required
### Unit (`Category = Unit`)
`ShutdownCoordinatorTests` (≥ 4 tests):
1. `Shutdown_NoActiveConnections_CompletesImmediately`
2. `Shutdown_OneActiveConnection_WaitsForCompletion`
3. `Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount`
4. `Shutdown_AdminEndpointStopped_AfterListenersStopped` — ordering test.
### E2E (`Category = E2E`)
`ShutdownE2ETests` (≥ 2 tests, against simulator):
1. `E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s` — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
2. `E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` — same but with a `Connection.BackendRequestTimeoutMs` that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.
### Manual / smoke
- Install the service via `install.ps1` on a clean test VM; confirm it appears in `services.msc` with `Local System` identity.
- `sc.exe start mbproxy` — service starts, admin endpoint at `http://localhost:8080/` shows the proxy is up.
- Send `sc.exe stop mbproxy` — service stops within 10 s.
- Trigger a crash (e.g., corrupt `appsettings.json` while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source `mbproxy`.
- `uninstall.ps1` — service removed cleanly; `%ProgramData%\mbproxy\` preserved unless `-KeepConfig` was not passed.
The manual smoke results go into `docs/operations.md` as a "first install" verification checklist.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0007 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e shutdown tests green.
- [ ] `mbproxy/README.md` exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
- [ ] Root `wwtools/CLAUDE.md` index row for `mbproxy` points at `mbproxy/README.md` (was previously pointing into the design plan or the bare folder).
- [ ] `install.ps1` and `uninstall.ps1` are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
- [ ] Windows Event Log source is created during install and removed during uninstall.
- [ ] `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true` produces a single executable under 50 MB.
- [ ] Manual smoke checklist in `docs/operations.md` has been executed on at least one fresh VM and the result documented.
## Out of scope
- Linux / Docker packaging. The design fixes Windows Service as the deployment target.
- Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
- A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
- Metric exposition for Prometheus / OpenTelemetry. The status page's `/status.json` is sufficient for the operational needs declared in the design.
## Notes for the subagent
- The Windows Event Log source creation requires admin rights — that's already a precondition for `install.ps1`. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
- Single-file publish makes `Assembly.GetExecutingAssembly().Location` empty. If `AssemblyVersionAccessor` (phase 07) used that, swap to `Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>()`.
- The `mbproxy/README.md` is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
- After this phase merges, the project is feature-complete against [`../design.md`](../design.md). Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.
+341
View File
@@ -0,0 +1,341 @@
# Phase 09 — MBAP TxId multiplexing (single backend connection per PLC)
Replace the 1:1 upstream-client ↔ backend-socket model with a **single backend connection per PLC**, multiplexed across all upstream clients via MBAP transaction-ID rewriting and a correlation map. After this phase the H2-ECOM100's 4-simultaneous-TCP-client cap is no longer an operational ceiling — the proxy holds exactly one slot per PLC regardless of how many upstream clients are connected.
**Status:** shipped 2026-05-14. Phases 00-08 shipped the production-ready 1:1 model; this phase swapped connection management without changing the transparent-rewrite contract.
## Implementation clarifications discovered during 2026-05-14 ship
These notes capture decisions and surprises that surfaced during the actual implementation. They supplement (not replace) the Tasks section below.
1. **A per-request timeout watchdog is part of Phase 9, not deferred.** The 1:1 model collapsed missing-response handling onto the dedicated backend socket dying. The multiplexed model needs an explicit timer because a single lost or mis-routed response would otherwise leak a correlation entry forever and hang the upstream pipe indefinitely. The watchdog ticks at quarter-`BackendRequestTimeoutMs` (min 100 ms), scans the correlation map, and times out stale requests with **Modbus exception 0x0B (Gateway Target Device Failed To Respond)** delivered to the upstream party with the original TxId restored. Log event `mbproxy.multiplex.request.timeout` (Warning).
2. **PlcListener constructs a multiplexer unconditionally.** The Phase-9 draft had `PlcListener` conditionally construct the multiplexer only when a `PerPlcContext` was supplied; the no-context fallback dropped accepted upstream sockets. Tests (and any pre-Phase-6 startup path that lacked a context) hit a regression. The fix is to construct a minimal default `PerPlcContext` from the `PlcOptions` if the caller didn't supply one, and require `_multiplexer` to be non-null when `RunAsync` runs.
3. **`BackendConnectFailure_ClosesUpstreamCleanly` is now lazy.** The 1:1 model attempted a backend connect at upstream-accept time, so simply opening a TCP connection to a proxy with a bad backend triggered the close. The multiplexed model connects to the backend on the *first upstream frame*, so the test has to send a Modbus request before the proxy attempts the (failing) backend connect that causes the upstream close. Updated in-place.
4. **pymodbus 3.13.0 simulator is broken under multiplexed concurrent requests.** Its `ServerRequestHandler` keeps a single `last_pdu` per connection and schedules `handle_later` via `asyncio.call_soon`; two MBAP frames in one recv buffer overwrite `last_pdu` before the first handler runs, and both responses carry the later TxId. The real DL260 ECOM properly echoes per-request TxIds. Consequence for tests:
- **Mux correctness under truly concurrent backend traffic is proven against the stub backend in `PlcMultiplexerTests`**, which models the DL260's correct TxId-echo behaviour.
- **`MultiplexerE2ETests` paces requests** so pymodbus only ever sees one MBAP frame at a time on the shared backend connection. The headline test (`E2E_FiveSimultaneousClients_AllReadHR1072_AllGetDecoded_1234`) verifies the connection ceiling lift (5 simultaneous upstream connections, where Phase-08's 1:1 model would have refused the 5th) — *not* the under-concurrency multiplexing behaviour.
- **The watchdog is the production defence** if any real backend (or future simulator) ever mis-echoes a TxId: stale entries time out cleanly with exception 0x0B rather than hanging upstream clients.
5. **E2E timeouts.** Per `docs/plan/README.md`'s Test discipline, all E2E tests are 5 s by default. Hot-reload tests that genuinely need 5 s + 3 s of propagation windows carry a 10 s timeout with a one-line comment; `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` carries 8 s for its sequential connects + Polly-paced reconnect path.
6. **`AsyncHostDispose` deadlock note.** Test fixtures that hold `IHost` via `await using` were originally written with a 5 s shutdown timeout; under Phase 9's drained-channel cleanup that occasionally exceeded the test's own `Timeout = 5000`. Reduced to 2-3 s where it doesn't materially affect the test's drain semantics.
**Depends on:** Phase 04 (rewriter), Phase 05 (supervisor + Polly), Phase 07 (status page DTO surface).
**Parallel-safe with:** nothing within itself. **Hard rule.** This phase deletes `PlcConnectionPair` and rewires the supervisor + rewriter correlation path simultaneously; the cross-cut is too broad for safe parallel work. The optional intra-phase slicing (below) is the closest thing to parallel.
## Goal
The H2-ECOM100 accepts 4 concurrent TCP clients per PLC; today's 1:1 model means the 5th upstream client to the same proxy port fails at backend connect. This phase eliminates that ceiling by making **one persistent backend socket per PLC**, with the proxy serving as a connection multiplexer that rewrites MBAP transaction IDs to keep concurrent in-flight requests from different upstream clients distinguishable on the single wire.
The wire-rate ceiling does not change — the H2-ECOM100 internally serializes requests (one per PLC scan, ~2-10 ms scan time) regardless of how many TCP connections it has. We're shifting where serialization happens (proxy outbound queue vs PLC accept queue), not adding throughput. The dashboard pay-off is that "PLC clients connected" can rise into the dozens without the proxy degrading.
## Intra-phase slicing (the closest thing to parallel-safe within this phase)
The phase is one merge but can be implemented as five small commits in this order:
| Slice | Output | Files touched | Hours | Parallelizable? |
|-------|--------|---------------|-------|-----------------|
| 9.1 | Pure data types (TxIdAllocator, CorrelationMap, InFlightRequest) + their unit tests | new files under `src/Mbproxy/Proxy/Multiplexing/` and `tests/...` | ~5 | Yes — pure logic, disjoint from rest. A second agent can write the E2E test scaffolding (slice 9.5) in parallel. |
| 9.2 | `PlcMultiplexer` + `UpstreamPipe` skeleton with backend reader/writer loops | new files in `Multiplexing/` | ~10 | No — depends on 9.1's data types. |
| 9.3 | Refactor `PlcListener` to own the multiplexer; delete `PlcConnectionPair`; rewire supervisor | modifies existing Proxy + Supervision files | ~8 | No — depends on 9.2. |
| 9.4 | Update `BcdPduPipeline` to use correlation entries (drop `PerPlcContextWithRequest`); counter additions; status DTO + HTML updates | modifies pipeline + admin files | ~6 | No — depends on 9.3. |
| 9.5 | Full E2E test suite + design.md + CLAUDE.md doc updates | new test file + doc edits | ~6 | Test-writing yes (slice 9.5 skeleton can land in parallel with 9.1); the doc edits at the end are sequential after 9.3. |
**Total:** ~35 hours. With one parallel agent producing slice 9.1's data types and another sketching the e2e test fixtures during slice 9.5-prep, calendar time can compress to ~28 hours.
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # single backend conn owner; mux logic
src/Mbproxy/Proxy/Multiplexing/UpstreamPipe.cs # per-upstream-client reader/writer
src/Mbproxy/Proxy/Multiplexing/TxIdAllocator.cs # 16-bit allocator with wrap tracking
src/Mbproxy/Proxy/Multiplexing/CorrelationMap.cs # proxyTxId → InFlightRequest
src/Mbproxy/Proxy/Multiplexing/InFlightRequest.cs # the correlation record
src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/TxIdAllocatorTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/CorrelationMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs # integration, real sockets
tests/Mbproxy.Tests/Proxy/Multiplexing/RewriterCorrelationTests.cs # rewriter w/ multiplexed paths
tests/Mbproxy.Tests/Proxy/Multiplexing/MultiplexerE2ETests.cs # against pymodbus sim
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/PlcListener.cs # owns PlcMultiplexer; accept loop hands sockets to it
src/Mbproxy/Proxy/PlcConnectionPair.cs # DELETED — replaced by UpstreamPipe + Multiplexer
src/Mbproxy/Proxy/IPduPipeline.cs # PduContext gains in-flight correlation entry
src/Mbproxy/Proxy/PerPlcContext.cs # delete PerPlcContextWithRequest; replaced by InFlightRequest passed per-call
src/Mbproxy/Proxy/BcdPduPipeline.cs # FC03/04 response decodes via InFlightRequest, not last-request slot
src/Mbproxy/Proxy/ProxyCounters.cs # new fields: InFlightCount, MaxInFlight, TxIdWraps, BackendDisconnectCascades, BackendQueueDepth
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # supervises mux lifecycle alongside listener
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains the new mux fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate mux fields from counters
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show inFlight/max-in-flight in the per-PLC row
docs/design.md # rewrite Connection model + Failure modes for multiplexed reality
mbproxy/CLAUDE.md # flip Architecture summary's connection-model bullet
docs/kpi.md # update operational notes referring to 4-client cap
```
## Tasks
### 9.1 Data types (pure logic)
1. **`TxIdAllocator`** — `internal sealed class TxIdAllocator`. State: `_inUse` (`bool[65536]` for O(1) lookup; ~64 KB), `_next` (`ushort`), `_inFlightCount` (long), `_wrapCount` (long). Methods:
- `bool TryAllocate(out ushort id)` — atomic via `lock` (the allocator is per-PLC, contention is low). Scans forward from `_next` for the next free slot; sets `_inUse[id] = true`; bumps `_next`. Returns `false` if `_inFlightCount == 65536` (saturated; emit `mbproxy.multiplex.saturated` Error and let caller decide to drop or queue).
- `void Release(ushort id)` — clears `_inUse[id]`; decrements `_inFlightCount`.
- `int InFlightCount { get; }`, `long WrapCount { get; }` — for telemetry.
- **Wrap counter:** increment whenever `_next` rolls over `0xFFFF → 0x0000`.
2. **`InFlightRequest` + `InterestedParty`** — `InterestedParty` is `internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId)`. `InFlightRequest` is `internal sealed record InFlightRequest(byte UnitId, byte Fc, ushort StartAddress, ushort Qty, IReadOnlyList<InterestedParty> InterestedParties, DateTimeOffset SentAtUtc)`. Carries enough state for: (a) restoring each party's original TxId on the way back, (b) the FC03/04 correlation the rewriter needs (start/qty), (c) routing the response to each interested upstream socket, (d) round-trip-time measurement.
**In Phase 9 `InterestedParties` always contains exactly one element.** The list shape is forward-compat with [Phase 10 — read coalescing](10-read-coalescing.md), which extends the same record to fan-out responses to multiple upstream clients without further refactor of the multiplexer's data model. Resist any reviewer suggestion to simplify it back to a single `UpstreamPipe Upstream` field — the list shape is the load-bearing foundation for Phase 10.
3. **`CorrelationMap`** — wraps a `ConcurrentDictionary<ushort, InFlightRequest>`. Methods: `bool TryAdd(ushort, InFlightRequest)`, `bool TryRemove(ushort, out InFlightRequest)`, `int Count { get; }`, `IReadOnlyCollection<InFlightRequest> Snapshot()` (for diagnostics; allocates a list). The dict is correct-by-construction for the mux's single-writer-add / single-reader-remove pattern; `ConcurrentDictionary` keeps it safe if/when we add upstream-side cancellation.
### 9.2 Multiplexer + UpstreamPipe
4. **`UpstreamPipe`** — `internal sealed class UpstreamPipe : IAsyncDisposable`. One instance per accepted upstream socket. Fields: `Socket _upstream`, `Guid _id`, `IPEndPoint _remoteEp`, `DateTimeOffset _connectedAtUtc`, `volatile bool _alive`, `Channel<byte[]> _responseChannel` (capacity 16). Two tasks:
- **Read task**: pumps inbound MBAP frames from `_upstream` to a per-pipe `OnFrame` callback (registered by the multiplexer).
- **Write task**: drains `_responseChannel` and writes each frame back to `_upstream`.
On fault: sets `_alive = false`, closes the socket, the multiplexer notices on next correlation lookup and drops responses bound for this pipe.
5. **`PlcMultiplexer`** — `internal sealed class PlcMultiplexer : IAsyncDisposable`. One instance per PLC. Fields: backend `Socket`, `TxIdAllocator`, `CorrelationMap`, `Channel<byte[]> _outboundChannel` (cap 256), `PerPlcContext _ctx` (tag map + counters + logger), list of attached `UpstreamPipe`s. Two backend tasks plus a fan-in:
- **Backend writer task**: drains `_outboundChannel` → writes to backend socket. Single writer; no synchronization on the socket needed.
- **Backend reader task**: reads MBAP frames from backend → looks up `proxyTxId` in `CorrelationMap` → calls `pipeline.Process(ResponseToClient, header, pdu, ctx with InFlight)` → for each `InterestedParty` in `InFlightRequest.InterestedParties` (always exactly one in Phase 9; list-of-N once Phase 10 ships): writes a copy of the frame with that party's `OriginalTxId` restored in the MBAP header to the party's `UpstreamPipe._responseChannel` (or drops silently for that party if its pipe is `_alive = false`) → `CorrelationMap.TryRemove(proxyTxId)` + `TxIdAllocator.Release(proxyTxId)`.
- **Per-upstream `OnFrame`**: invoked by each `UpstreamPipe`'s read task. Steps:
1. Parse MBAP: original TxId, length, unitId, PDU.
2. `TryAllocate` a proxyTxId. If saturated, write a Modbus exception response (Slave Device Failure, code 04) back to upstream and continue.
3. Build `InFlightRequest` (parse FC/start/qty from PDU if FC03/04 — needed for FC06 too if we want the symmetric correlation later).
4. `TryAdd` to correlation map.
5. Call `pipeline.Process(RequestToBackend, ...)` to apply BCD rewriting.
6. Overwrite MBAP TxId bytes with proxyTxId.
7. Enqueue the modified frame into `_outboundChannel`.
6. **Backend disconnect handling** — when the backend reader/writer task throws (socket closed, network reset, etc.):
- Stop both tasks; close the backend socket.
- Walk the correlation map; for each entry, close that entry's `UpstreamPipe` (cascade). Increment `BackendDisconnectCascades` by the upstream-pipe count.
- Clear correlation map and TxIdAllocator.
- The supervisor's Polly pipeline takes over for backend reconnect — when the next upstream request arrives, the multiplexer attempts a fresh backend connection through the Polly pipeline.
### 9.3 Listener + supervisor refactor
7. **`PlcListener.RunAsync`** — accept loop changes:
- One `PlcMultiplexer` per listener (constructed in `PlcListenerSupervisor` and handed in).
- On accept: wrap the socket in `UpstreamPipe`, register with the multiplexer via `mux.Attach(pipe)`.
- On listener stop: dispose the multiplexer (which closes the backend + all attached pipes).
- `ActivePairs` property → renamed `ActiveUpstreams` returning the multiplexer's list of attached `UpstreamPipe`s. Status page consumes this.
8. **Delete `PlcConnectionPair.cs`** — entire file. The replacement is `UpstreamPipe` + `PlcMultiplexer`. No backwards-compat shims; we're moving cleanly.
9. **`PlcListenerSupervisor`** — gains ownership of `PlcMultiplexer` alongside the listener. The Polly listener-recovery pipeline is unchanged; the multiplexer has its own internal Polly backend-connect pipeline (same `ResilienceOptions.BackendConnect` shape as today, just owned by the mux instead of the pair).
### 9.4 Rewriter + counters + status page
10. **`BcdPduPipeline`** — the FC03/04 response path stops reading `PerPlcContextWithRequest.LastRequestStart/Qty`. Instead, the multiplexer attaches an `InFlightRequest` to the `PduContext` for each response call:
```csharp
public sealed class PerPlcContext : PduContext {
public BcdTagMap TagMap { get; init; }
public ProxyCounters Counters { get; init; }
public ILogger Logger { get; init; }
public InFlightRequest? CurrentRequest { get; init; } // NEW — non-null on response, null on request
}
```
Concurrency: each backend response is handled on the backend reader task; the request path is handled by the per-upstream read task. Different `InFlightRequest` instances → no contention.
11. **Drop `PerPlcContextWithRequest`** entirely. The last-request-slot pattern was a 1:1-model workaround; the correlation map subsumes it.
12. **`ProxyCounters` additions:**
- `InFlightCount` (`long` snapshot of `CorrelationMap.Count`)
- `MaxInFlight` (`long`, peak observed via `Interlocked.Max`)
- `TxIdWraps` (`long` from `TxIdAllocator.WrapCount`)
- `BackendDisconnectCascades` (`long`)
- `BackendQueueDepth` (snapshot of `_outboundChannel.Reader.Count`)
13. **Status page** — `StatusDto.PlcBackendStatus` gains `InFlight`, `MaxInFlight`, `TxIdWraps`, `DisconnectCascades`, `QueueDepth`. `StatusSnapshotBuilder` populates them. `StatusHtmlRenderer` adds a column or compact `[3/256]` indicator per PLC row. The JSON field names land in camelCase per the existing source-gen convention.
### 9.5 Tests + docs
14. **Unit + integration test suites** (see Tests required below).
15. **`docs/design.md` updates:**
- **Connection model** section: rewrite. The diagram changes from "many clients → many backend sockets" to "many clients → one backend socket per PLC, multiplexed by proxy TxId rewriting." The operational consequence warning flips: instead of "5th client fails," it becomes "if backend disconnects, all attached upstream clients are cascaded closed; they reconnect on their own next request."
- **Failure modes** section: amend to describe the cascade behaviour.
- **Rewriter** section: amend to note the rewriter consumes `InFlightRequest` for response correlation (no architectural change, just an update to the description of how correlation flows).
16. **`mbproxy/CLAUDE.md`** Architecture summary: first bullet flips from "1:1 upstream-client ↔ backend-socket" to "single backend socket per PLC, multiplexed via MBAP TxId rewriting."
17. **`docs/kpi.md`** — the "Tier 2 → Connection-cap saturation warning" KPI loses its meaning (4-client cap no longer relevant on the upstream side). Either remove it or repurpose to track in-flight saturation against the 16-bit TxId space (which never realistically saturates but is the new equivalent ceiling).
## Public surface declared in this phase
All `internal sealed` — the multiplexer types are not consumed outside the assembly.
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal sealed class TxIdAllocator {
public bool TryAllocate(out ushort id);
public void Release(ushort id);
public int InFlightCount { get; }
public long WrapCount { get; }
}
internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
internal sealed record InFlightRequest(
byte UnitId, byte Fc,
ushort StartAddress, ushort Qty,
IReadOnlyList<InterestedParty> InterestedParties,
DateTimeOffset SentAtUtc);
// Phase 9: InterestedParties.Count is always 1.
// Phase 10 (read coalescing): the same record fans out to N parties without further refactor.
internal sealed class CorrelationMap {
public bool TryAdd(ushort proxyTxId, InFlightRequest req);
public bool TryRemove(ushort proxyTxId, out InFlightRequest req);
public int Count { get; }
public IReadOnlyCollection<InFlightRequest> Snapshot();
}
internal sealed class UpstreamPipe : IAsyncDisposable {
public Guid Id { get; }
public IPEndPoint RemoteEp { get; }
public DateTimeOffset ConnectedAtUtc { get; }
public long PdusForwardedCount { get; }
public bool IsAlive { get; }
public Task RunReadLoopAsync(Func<byte[], Task> onFrame, CancellationToken ct);
public ValueTask SendResponseAsync(byte[] frame, CancellationToken ct);
public ValueTask DisposeAsync();
}
internal sealed class PlcMultiplexer : IAsyncDisposable {
public void Attach(UpstreamPipe pipe);
public IReadOnlyCollection<UpstreamPipe> AttachedPipes { get; }
public Task RunAsync(CancellationToken ct);
public ValueTask DisposeAsync();
}
```
`PerPlcContext` gains a nullable `CurrentRequest` property. `PerPlcContextWithRequest` is removed (along with its `LastRequest*` slots).
## Tests required
### Unit (`Category = Unit`)
**`TxIdAllocatorTests`** (≥ 8 tests):
1. `Allocate_FromEmpty_Returns_NextSequential`
2. `Allocate_AfterRelease_Reuses_FreedId`
3. `Allocate_AllocatesEveryUshort_BeforeWrapping`
4. `Allocate_WrapsCorrectly_After0xFFFF`
5. `Allocate_WhenSaturated_ReturnsFalse_DoesNotThrow`
6. `Release_OfNonAllocated_IsNoOp`
7. `Concurrent_AllocateRelease_NoDuplicateIds_Under_Parallel_Stress` (100 tasks, 1000 ops each)
8. `WrapCount_IncrementsOnEachFullWrap`
**`CorrelationMapTests`** (≥ 5 tests):
1. `TryAdd_Then_TryRemove_RoundTrips`
2. `TryAdd_DuplicateKey_Fails`
3. `TryRemove_OfMissing_ReturnsFalse`
4. `Snapshot_ReflectsCurrentState`
5. `Concurrent_AddRemove_NoDataLoss_Under_Parallel_Stress`
**`PlcMultiplexerTests`** (≥ 7 tests, real sockets, no simulator):
1. `SingleUpstream_RoundTripsFC03_Through_Multiplexer`
2. `SingleUpstream_RoundTripsFC06_Through_Multiplexer`
3. `TwoUpstreams_ConcurrentFC03_BothGetCorrectResponses` — proves TxId rewriting works end-to-end against a stub backend
4. `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` — sniff the backend socket; verify per-request TxIds are unique even when upstream TxIds collide
5. `UpstreamDisconnect_DoesNotAffectOtherUpstreams` — drop one client mid-flight; other client's response still arrives
6. `BackendDisconnect_CascadesToAllUpstreams` — kill backend; verify all upstream sockets close within 500 ms, `BackendDisconnectCascades` increments by N
7. `BackendReconnect_AfterCascade_NextUpstreamRequest_Succeeds`
**`RewriterCorrelationTests`** (≥ 4 tests):
1. `FC03Response_DecodedViaInFlightRequest_NotPerPairSlot`
2. `ConcurrentFC03_FromTwoUpstreams_DecodeCorrectly_NoCrossTalk` — set up two `InFlightRequest`s with different start addresses, deliver responses out of order; verify each decodes against its own request
3. `ConcurrentFC06_FromTwoUpstreams_EncodeCorrectly`
4. `ResponseForDeadUpstream_IsDropped_NoExceptionPropagates`
### Integration (`Category = Unit`, no simulator)
These use real `TcpListener` + `Socket` against a stub backend (a `TcpListener` that just echoes or canned-responds). They live in `PlcMultiplexerTests`.
### E2E (`Category = E2E`)
**`MultiplexerE2ETests`** (≥ 5 tests, against pymodbus simulator):
1. `E2E_FiveConcurrentClients_AllReadHR1072_AllGetDecoded_1234` — the headline test. Five NModbus clients connected to the proxy in parallel; pymodbus sim has the BCD register at 1072. All five get `1234`. With Phase 08's 1:1 model, the 5th client would fail at backend connect.
2. `E2E_TwentyConcurrent_FC03_Requests_AcrossThreeClients_AllSucceed`
3. `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` — kill the sim mid-flight (simulate by closing on its side); verify upstream clients see clean socket close; relaunch sim; new upstream connection succeeds.
4. `E2E_RewriterStillWorks_UnderMultiplexedThreeClients` — three clients each writing different decimal values to different BCD-configured addresses via FC06; verify sim's register state.
5. `E2E_StatusPage_Shows_InFlightAndMaxInFlight` — drive 4 concurrent reads, verify `/status.json` reports `inFlight >= 1` during the burst and `maxInFlight >= 4`.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All 271 prior tests still green. Specifically: `Forward_FC03_HR1072_Returns_Decoded_1234`, `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`, `MbapTxId_IsPreservedEndToEnd`, and `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` continue to pass against the multiplexed implementation. The MBAP-TxId-preserved tests are the **critical regression guard** — if multiplexing leaks proxy TxIds back to the client, these fail.
- [ ] All new unit tests pass (≥ 24 new in slices 9.1-9.2 alone).
- [ ] All new E2E tests pass (≥ 5).
- [ ] `Forward_FC03_HR1072_Returns_Decoded_1234` PASSES with 5 concurrent NModbus clients connected to the same proxy port. **This is THE phase test.**
- [ ] `PlcConnectionPair.cs` is gone. Grep for the type name across the solution returns zero hits.
- [ ] `PerPlcContextWithRequest` is gone. Grep returns zero hits.
- [ ] `docs/design.md` "Connection model" section is rewritten; the 1:1 model description is gone or moved into a "Historical: pre-Phase-09 model" footnote.
- [ ] `mbproxy/CLAUDE.md` Architecture summary's connection-model bullet is updated.
- [ ] Backend disconnect with N upstream clients in-flight: all N close within 500 ms; counter `BackendDisconnectCascades += N`.
- [ ] `mbproxy.multiplex.saturated` Error event fires if TxId allocator hits 65,536 in-flight. (Stress-test acceptable; manufacture by holding 65,536 pending responses against a stub backend.)
- [ ] Shutdown semantics still work: `ShutdownCoordinator` drains in-flight requests (now visible via `InFlightCount`, not `IsProcessing`).
- [ ] Status page renders the new fields; HTML page weight remains under 50 KB for 54 PLCs.
- [ ] CounterSnapshot's existing field set is preserved — only **added** fields, no renames or removals. Backwards-compat per the policy in `docs/kpi.md`.
## Out of scope
- **Foundation for future caching, not caching itself.** This phase establishes the chokepoint where any future caching or coalescing layer plugs in, but implements no caching of any kind. `InFlightRequest.InterestedParties` is shaped as a list specifically to make [Phase 10 — read coalescing](10-read-coalescing.md) additive without refactor; do not infer caching behavior from the list shape alone. Tier C-2 (short-TTL response cache) and Tier C-3 (periodic poll + cache) remain explicitly out of scope until their own design discussions and `design.md` updates land.
- **Per-tag read coalescing** — if two clients read the same register at the same time, Phase 9's multiplexer sends both requests. Coalescing them into one backend round-trip is the explicit goal of [Phase 10](10-read-coalescing.md), which plugs into the `InterestedParties` seam created here.
- **Backend keepalive / heartbeat** — the design's current "no keepalive" position stands. An idle backend with no upstream activity will die after middlebox timeouts; the next upstream request triggers a fresh connect via Polly. Multiplexing doesn't change this.
- **TxId fairness scheduling** — FIFO order in the `_outboundChannel` is the contract. No round-robin per upstream, no priority. If a single upstream client floods the channel, others queue behind. This is a stated trade-off and matches the ECOM's internal serialization anyway.
- **Pipelined multi-PDU-in-flight per single upstream client** — still unsupported. One in-flight request per upstream pipe at a time. Multiplexing across DIFFERENT upstream clients works fully; multiplexing across multiple in-flight requests from the SAME upstream client does not. Document the constraint.
- **Linux / cross-platform packaging** — still Windows Service only.
## Subagent briefing
If you're the agent picking up this phase, here's the executive summary you need in your head:
1. **You are deleting `PlcConnectionPair`.** Everything that file did is now split between `UpstreamPipe` (the per-client half) and `PlcMultiplexer` (the per-PLC half). Read `PlcConnectionPair.cs` once before you delete it — every behavior in there has a destination in one of the two new classes.
2. **Single-writer / single-reader on the backend socket.** Two tasks share the backend socket: one writes (drained from `_outboundChannel`), one reads (decodes MBAP frames). No third task touches the socket. This invariant is what makes the channel + dictionary design correct without locks.
3. **The rewriter doesn't know about MBAP framing or correlation.** It still receives `(direction, mbapHeader span, pdu span, PerPlcContext ctx)`. The only addition is `ctx.CurrentRequest` (nullable, non-null on response). The rewriter is otherwise unchanged. Resist refactoring it.
4. **`InFlightRequest.SentAtUtc` powers `lastRoundTripMs` correctly across multiplexed clients.** Today's EWMA is per-pair; under multiplexing, the timestamp moves to per-request. The status counter stays the same.
5. **Cascade-on-backend-disconnect is the most subtle behavior.** Get the test for it right early (`BackendDisconnect_CascadesToAllUpstreams`). It's the difference between "graceful failure" and "leaked upstream sockets that hold connections open until OS timeout."
6. **TxId allocator saturation is a real-world impossibility but a stress-test reality.** Hold 65,536 responses in a stub backend; the allocator must refuse the 65,537th cleanly with an exception response code 04, not crash.
7. **Update the docs in the SAME PR as the code.** `design.md` Connection model, `mbproxy/CLAUDE.md` Architecture summary, and `docs/kpi.md` connection-cap KPI either get rewritten or removed. Doc drift is a gate fail.
8. **Do NOT introduce parallel agents within this phase.** The cross-cut is too broad. If you have spare agent budget, slice 9.1 (data types + their unit tests) can run alongside slice 9.5 (e2e test scaffolding writing against the unchanged outer-shape contract) but the middle slices are sequential.
9. **The 4 critical regression tests** that must stay green:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd` ← THIS is the one that proves multiplexing is transparent.
10. **When in doubt, re-read `BcdPduPipeline.ProcessResponse`.** The FC03/04 correlation logic there is the most subtle existing code that you're touching. Walk through it with one upstream client in mind first, then mentally replay with two; both must work without code change to the pipeline (only the way `PerPlcContext.CurrentRequest` gets populated changes).
## Cross-references
- Today's 1:1 model: [`../design.md`](../design.md) → "Connection model" (will be rewritten by this phase).
- DL260 4-client cap source: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Behavioral Oddities".
- Existing rewriter request→response correlation: `src/Mbproxy/Proxy/BcdPduPipeline.cs` `ProcessResponse` (lines reading `PerPlcContextWithRequest.LastRequest*`).
- Polly pipelines this phase reuses without modification: `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs`.
- Counter-snapshot backwards-compat policy: [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
+308
View File
@@ -0,0 +1,308 @@
# Phase 10 — Read coalescing (in-flight only, zero staleness)
When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.
**Status:** post-1.0 follow-on, depends on Phase 9.
**Depends on:** Phase 09 (multiplexer + `InFlightRequest` with `InterestedParties` list shape).
**Parallel-safe with:** nothing. The phase modifies `PlcMultiplexer.OnFrame` and the backend reader fan-out path; both are tightly coupled.
## Goal
Phase 9's multiplexer routes every upstream request individually, even when two upstream clients are asking for identical data. In a fleet of 54 PLCs where the HMI, historian, and engineering workstation all poll the same screen tags every second, that's up to 3× redundant backend traffic per overlapping read — and the H2-ECOM100's single-request-per-scan internal serialization means redundant traffic compounds into measurable backend latency.
Phase 10 detects same-key reads within the in-flight window and serves them from a single backend response. Coalescing operates entirely between "first request sent to backend" and "response received from backend." Once the response is fanned out, the coalescing entry dies. No values are held past the response arrival; no invalidation logic; no design-doc change to the "not a polling/cache layer" stance.
## Why this is safe — the zero-staleness argument
A coalesced response is a value the backend was going to return to the first request anyway. By the time the second client's request arrives, the first request is already on the wire to the PLC. The PLC's response represents the register values at the moment the PLC serviced the request. Even if the second request had been sent separately on its own backend round-trip, the H2-ECOM100's internal serialization would have queued it behind the first, returning the same value (or a value as old as one extra PLC scan ≈ 2-10 ms older).
In other words: the only thing Phase 10 changes is whether the proxy sends one or two requests to the PLC. The answer the upstream clients see is identical (or fresher than the "two requests" alternative, since coalescing means the second client doesn't wait for a second backend round-trip).
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs # readonly record struct
src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs # ConcurrentDictionary wrapper with atomic attach-or-create
src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/CoalescingKeyTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/InFlightByKeyMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingE2ETests.cs
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame learns coalescing path; reader fans out
src/Mbproxy/Proxy/ProxyCounters.cs # new: CoalescedHitCount, CoalescedMissCount, CoalescedResponseToDeadUpstream
src/Mbproxy/Options/ResilienceOptions.cs # new: ReadCoalescing sub-options
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains coalescing fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate new fields
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show coalescing ratio in per-PLC row
docs/design.md # Rewriter section: note FC03/04 may be coalesced before reaching backend
docs/kpi.md # graduate "coalescing ratio" KPI from future to supported
install/mbproxy.config.template.json # add the new Resilience.ReadCoalescing section with comments
```
`InFlightRequest.cs` does **not** change — the `InterestedParties` list shape was specifically introduced in Phase 9 to make this phase additive.
## Tasks
### 10.1 Data types
1. **`CoalescingKey`** — `readonly record struct CoalescingKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. Hash key for the in-flight-by-key map. Auto-generated record-struct equality. Verify hashcode distribution is reasonable for typical V-memory address ranges (smoke-test in unit tests).
2. **`InFlightByKeyMap`** — wraps `ConcurrentDictionary<CoalescingKey, InFlightRequest>` plus a small lock for atomic attach-or-create. Methods:
- `bool TryAttachOrCreate(CoalescingKey key, InterestedParty party, Func<InFlightRequest> factory, int maxParties, out InFlightRequest req, out bool wasNew)` — atomic: if the key exists and `req.InterestedParties.Count < maxParties`, append the party to a freshly-built `IReadOnlyList<InterestedParty>` (since the record is immutable, we substitute a new `InFlightRequest` with the extended list in the map) and return `(wasNew=false)`; else call factory to build a new entry, store it, return `(wasNew=true)`.
- `bool TryRemove(CoalescingKey key, out InFlightRequest req)` — called by the backend reader after fan-out completes.
- The "attach to existing" path is the load-bearing concurrency primitive of this phase. The simpler implementation: small `lock` around the attach branch. The lock-free implementation uses `AddOrUpdate` with a comparand check. Pick the simpler one; document the choice in code.
### 10.2 Multiplexer integration
3. **Request path** in `PlcMultiplexer.OnFrame`:
```csharp
bool coalesceCandidate = (fc is 0x03 or 0x04)
&& resilienceOptions.CurrentValue.ReadCoalescing.Enabled;
if (coalesceCandidate)
{
var key = new CoalescingKey(unitId, fc, startAddr, qty);
var party = new InterestedParty(upstreamPipe, originalTxId);
InFlightRequest? req;
bool wasNew;
inFlightByKey.TryAttachOrCreate(
key, party,
factory: () => BuildAndRegisterNew(unitId, fc, startAddr, qty, party),
maxParties: resilienceOptions.CurrentValue.ReadCoalescing.MaxParties,
out req, out wasNew);
if (!wasNew)
{
counters.IncrementCoalescedHit();
return; // do NOT send to backend — first request will get the response
}
counters.IncrementCoalescedMiss();
// fall through: factory already allocated proxyTxId + added to correlation map + sent
return;
}
// FC06/FC16 or coalescing disabled: existing Phase 9 path (allocate, register, send).
```
The factory closure does the existing Phase 9 work (TxId allocate, correlation map add, MBAP rewrite, send to outbound channel). The new code only adds the "is this already in-flight?" check before that work.
4. **Response fan-out** in the backend reader task — already shaped correctly by Phase 9; this phase just makes sure the `CoalescingKey` matching the response is also removed from `InFlightByKeyMap` alongside the `CorrelationMap` removal:
```csharp
if (correlationMap.TryRemove(proxyTxId, out var req))
{
txIdAllocator.Release(proxyTxId);
// Also clear the coalescing key so a new identical request after this point starts fresh.
var key = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
inFlightByKey.TryRemove(key, out _);
// Phase 9's fan-out loop — already iterates InterestedParties.
foreach (var party in req.InterestedParties)
{
if (!party.Pipe.IsAlive)
{
counters.IncrementCoalescedResponseToDeadUpstream();
continue;
}
var partyFrame = WithTxId(responseFrame, party.OriginalTxId);
party.Pipe.SendResponse(partyFrame);
}
}
```
### 10.3 Configuration
5. **Extend `ResilienceOptions`:**
```csharp
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
public sealed class ResilienceOptions
{
public RetryProfile BackendConnect { get; init; } = new();
public RecoveryProfile ListenerRecovery { get; init; } = new();
public ReadCoalescingOptions ReadCoalescing { get; init; } = new(); // ← new
}
```
Hot-reloadable via the existing `IOptionsMonitor<MbproxyOptions>` wiring. Disabling `Enabled` at runtime means new requests take the non-coalescing path; existing in-flight coalesced entries drain naturally.
6. **`mbproxy.config.template.json` update** — add a commented `ReadCoalescing` block to the install template under `Resilience` with the two new keys, default values, and a one-paragraph explanation.
### 10.4 Counters and status surfacing
7. **`ProxyCounters` additions:**
```csharp
public void IncrementCoalescedHit();
public void IncrementCoalescedMiss();
public void IncrementCoalescedResponseToDeadUpstream();
```
`CounterSnapshot` gains `CoalescedHitCount`, `CoalescedMissCount`, `CoalescedResponseToDeadUpstream` — all `long`, all Interlocked. The status page derives `coalescingRatio = Hit / (Hit + Miss)` for display; the raw counts are exposed in JSON for downstream tooling.
8. **`/status.json` per-PLC fields** — extend `PlcBackendStatus`:
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, // ← new
long CoalescedMissCount, // ← new
long CoalescedResponseToDeadUpstream); // ← new
```
9. **HTML page** — extend the per-PLC row with a compact `Coal: 73%` cell (`hit / (hit+miss) * 100`, rounded). Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 10.5 Documentation
10. **`docs/design.md` Rewriter section:** add a paragraph clarifying that FC03/FC04 requests may be coalesced with other in-flight requests of the same `(unitId, fc, start, qty)` before reaching the backend. Emphasize that the transparency contract holds — each client sees its own original TxId restored on the response, and the response value is identical to what an uncoalesced request would have returned (within the PLC's scan-time precision).
11. **`docs/kpi.md` Tier 1:** the new `coalescedHitCount`, `coalescedMissCount`, derived `coalescingRatio` graduate from "future" to "supported" Tier 1 fields. Mention the `coalescedResponseToDeadUpstream` counter as a low-priority Tier 2 informational metric.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal readonly record struct CoalescingKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed class InFlightByKeyMap
{
public bool TryAttachOrCreate(
CoalescingKey key,
InterestedParty party,
Func<InFlightRequest> factory,
int maxParties,
out InFlightRequest req,
out bool wasNew);
public bool TryRemove(CoalescingKey key, out InFlightRequest req);
public int Count { get; }
}
```
```csharp
namespace Mbproxy.Options;
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
// Added field on existing ResilienceOptions:
public ReadCoalescingOptions ReadCoalescing { get; init; } = new();
```
`ProxyCounters` and `CounterSnapshot` gain three new `long` fields. No public-surface removals, no renames.
## Tests required
### Unit (`Category = Unit`)
**`CoalescingKeyTests`** (≥ 4 tests):
1. `Equality_OnIdenticalKeys_ReturnsTrue`
2. `Equality_OnDifferentFc_ReturnsFalse` — FC03 vs FC04 with same start/qty/unit are NOT equal (different Modbus tables).
3. `Equality_OnDifferentUnitId_ReturnsFalse`
4. `HashCode_DistributionSanity` — build 10,000 randomly-generated keys, bucket by `Key.GetHashCode() & 0xFF`, assert no bucket has > 5 % of total (rough uniformity check).
**`InFlightByKeyMapTests`** (≥ 6 tests):
1. `TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue`
2. `TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse`
3. `TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend` — refuses to fan out beyond the cap; preserves backend-load-shedding guarantee.
4. `TryRemove_AfterAttach_AllPartiesPresent_InRetrievedEntry`
5. `TryRemove_OfMissing_ReturnsFalse`
6. `Concurrent_AttachOrCreate_From_Two_Threads_NoLostParties_AndNoDuplicateEntries` — 100 tasks × 1000 ops each.
**`ReadCoalescingTests`** (≥ 7 tests, real sockets, stub backend):
1. `TwoClients_SameRequest_OnlyOneBackendRoundTrip` — stub backend counts received requests; assert 1.
2. `TwoClients_DifferentRequests_BothHitBackend` — different start addresses; assert 2.
3. `FiveClients_SameRequest_OneBackendRoundTrip_FiveResponses` — fan-out works correctly with 5 attached parties.
4. `FC03_And_FC04_SameAddress_NOT_Coalesced` — different tables.
5. `FC06_Write_NeverCoalesced` — writes always allocate their own TxId.
6. `OneClient_DisconnectsMidFlight_OthersStillGetResponse_AndDeadUpstreamCounterIncrements`
7. `AtMaxParties_NextRequest_StartsFreshBackendRoundTrip` — verify the cap behaviour: when `MaxParties = 2` and 3 simultaneous clients send the same request, the third opens a new in-flight entry rather than joining the first.
### E2E (`Category = E2E`)
**`ReadCoalescingE2ETests`** (≥ 5 tests, against pymodbus simulator, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `E2E_FiveConcurrentClients_SameReadHR1072_CoalescedHitCount_AtLeast_3` — five NModbus clients connect to the proxy, simultaneously read HR1072 (BCD-configured). Assert `coalescedHitCount >= 3` (race wiggle room — perfect coalescing would give 4 hits, but the racy first-arrivals can both miss).
2. `E2E_RewriterStillWorks_ForAllCoalescedParties` — same setup, but with BCD tag at 1072. All five clients receive decoded `1234`. Proves the rewriter sees a coalesced response correctly and the TxId restoration doesn't perturb the BCD bytes.
3. `E2E_DifferentRegisters_NotCoalesced_CoalescedHitCount_Zero` — five clients reading five different addresses; assert no coalescing happened.
4. `E2E_StatusPage_Shows_CoalescingRatio` — `/status.json` for the test PLC has populated `coalescedHitCount` and `coalescedMissCount` after the burst.
5. `E2E_DisableViaHotReload_RevertToPhase9Behaviour` — write a temp appsettings with `ReadCoalescing.Enabled = false`, hot-reload, verify subsequent identical reads each hit the backend separately (counter doesn't increment).
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — specifically the **4 critical Phase-9 regression guards**:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd`
- [ ] All new unit + e2e tests pass (≥ 17 new).
- [ ] **Headline assertion:** 5 concurrent FC03 reads of the same register through the proxy produce **at most 2** backend round-trips (allowing one race for the initial pair). Verifiable via stub-backend's request counter in `ReadCoalescingTests`.
- [ ] FC04 reads of the same address as a coexisting FC03 stream do NOT coalesce together. Verified by an explicit test.
- [ ] FC06 / FC16 writes are NEVER on the coalescing path. Verified by setting `MaxParties = 1` and confirming write throughput is unaffected.
- [ ] Coalescing-ratio counter ≥ 50 % under the headline stress test (5 simultaneous identical reads).
- [ ] Disabling coalescing via `Mbproxy.Resilience.ReadCoalescing.Enabled = false` hot-reloads cleanly; running coalesced entries drain naturally without errors.
- [ ] `docs/design.md` Rewriter section mentions the coalescing path; `docs/kpi.md` Tier 1 includes the new fields; `install/mbproxy.config.template.json` includes the new commented `Resilience.ReadCoalescing` block.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
## Out of scope
- **Post-response caching** — no TTL, no staleness window beyond "while the request is in flight." This phase is strictly in-flight. A response-cache phase would be a separate plan (Phase 11+) and would require the design.md "not a cache layer" stance to be revisited and rewritten.
- **Range-overlap coalescing** — request A reading [100..110], request B reading [105..115]. Different keys; no coalescing. Range-overlap detection is a separate optimisation with its own algorithmic complexity (interval trees, etc.) and its own staleness questions (request B's response would include reg 100..104 from A's perspective, but those weren't in B's response).
- **Cross-PLC coalescing** — each PLC's multiplexer has its own key map. No optimization across PLCs (their backend connections are independent anyway).
- **Write coalescing / batching** — different problem with non-idempotency concerns. The design doc's "no mid-request retry on writes" principle extends to "no write coalescing."
- **Predictive batching** — combining a single client's likely-next read into the current request. Out of scope; speculative reads are a different optimization category.
- **Adaptive `MaxParties`** — staying at the configured value. Auto-tuning is interesting but speculative.
## Subagent briefing
If you're the agent picking up this phase:
1. **Phase 9's `InterestedParties` list is the seam.** This phase only adds the "look up the key, attach a new party to an existing entry" logic. The fan-out side already iterates the list correctly. If you find yourself rewriting Phase 9's response path, you've drifted out of scope.
2. **`CoalescingKey` includes `UnitId`.** DL260 fleets typically use unit 1, but we don't assume — different unit IDs are different PLC personalities behind the same TCP socket and must not coalesce.
3. **FC03 and FC04 are different tables.** Same register address space in DL series, but Modbus treats them separately. Different `CoalescingKey` for the same address; no coalescing across them.
4. **Coalescing is best-effort under races.** Two simultaneous identical requests can both miss the map and create separate entries — counter just shows a lower ratio. Not a bug; documented behaviour. Do not over-engineer with double-checked locking.
5. **`MaxParties` is the load-shedding safety valve.** If a thousand HMI panels all attach to one in-flight request, the response fan-out cost goes linear with attachment count and stalls the backend reader task. Cap at 32 by default. Past the cap, route through a fresh entry — fan-out cost per entry is bounded.
6. **The attach-or-create operation MUST be atomic per key.** Two simultaneous arrivals must not both create new entries for the same key (would defeat coalescing). The simpler implementation: `lock(map.SyncRoot)` around the attach branch. The lock-free implementation uses `AddOrUpdate` with the updateFactory checking the count cap. Pick whichever you can write correctly in 30 minutes; document the choice.
7. **Response fan-out must check `Pipe.IsAlive` per party.** An upstream client that disconnects between attaching and the response arriving — count it as `CoalescedResponseToDeadUpstream` and continue with the others. Do not throw, do not log per-occurrence at Information (would be too noisy under client churn).
8. **Hot-reload of `Enabled` doesn't disrupt in-flight entries.** Disabling the feature mid-flight just means subsequent requests take the non-coalescing path. Existing coalesced entries drain when their response arrives. Don't try to "flush" them on the reload event.
9. **`CoalescedHit + CoalescedMiss = total FC03+FC04 requests`.** The math has to balance per snapshot. Use `Interlocked.Increment` exclusively. Disabling coalescing means every FC03/04 request becomes a Miss (which is fine — the metric still tracks total reads).
10. **Update `design.md` AND `kpi.md` AND the install template in the same PR as the code.** Doc drift is a gate failure. The coalescing-ratio KPI specifically graduates from "future" to "Tier 1 supported" — make that promotion explicit in `kpi.md`.
## Cross-references
- Phase 9's multiplexer is the foundation. The `InterestedParty` and `InterestedParties` types live there: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
- Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Function Code Support" and "Coils and Discrete Inputs".
- Counter snapshot backwards-compat policy that this phase respects (additive only): [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
+374
View File
@@ -0,0 +1,374 @@
# Phase 11 — Short-TTL response cache (bounded staleness)
Cache FC03/FC04 responses with a per-tag TTL. Subsequent same-key reads within the TTL window are served from the cache without backend traffic. FC06/FC16 writes invalidate overlapping cache entries on the response side. **This phase is a deliberate design-contract change** — the proxy gains an opt-in cache layer with explicit bounded staleness.
**Status:** post-1.0 follow-on, depends on Phase 10. **Architectural pivot — read the "Design pivot" section below before scoping.**
**Depends on:** Phase 09 (multiplexer chokepoint), Phase 10 (`CoalescingKey` is reused as `CacheKey` — same shape).
**Parallel-safe with:** nothing.
## Design pivot — do NOT skip this section
Phases 09 and 10 were additive performance optimisations that preserved the design's "transparent inline proxy" contract. **Phase 11 is different.** It changes the load-bearing claim in `docs/design.md`:
- **Today's contract** (lines 12-20 of `design.md`): *"The service is not a polling/cache layer. It is a transparent Modbus TCP proxy whose job is to rewrite the configured BCD tags in real time, in both directions, while proxying every other byte of the MBTCP connection untouched."*
- **Post-Phase-11 contract:** the proxy is *optionally* a cache layer within a bounded TTL. The TTL is per-tag, default 0 (no caching), opt-in by operator action.
Implication: **Task 1 of this phase is rewriting the relevant `design.md` sections.** The contract update is a code commit too — review, land first, then build the implementation against the new contract. Shipping cache code while design.md still says "not a cache layer" is a gate failure, not a merge-it-and-fix-later situation.
The cache is **OFF by default**. A fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. The opt-in shape (per-tag `CacheTtlMs` configuration) means a deployment can adopt Phase 11 without changing semantics until an operator explicitly opts a tag in.
## Goal
Reduce backend Modbus traffic for the common SCADA case where many clients poll the same registers at near-identical cadences. Phase 10 already coalesces within the in-flight window (~10 ms). Phase 11 extends the "served without backend traffic" window from the in-flight microseconds to operator-configurable seconds.
Concretely: with `CacheTtlMs = 1000` on a frequently-read BCD tag, the backend sees at most one read of that tag per second per PLC regardless of how many upstream clients are polling.
## What it does NOT do
- **No active polling.** Cache entries are populated on demand by upstream reads, not by proactive polling. (Active polling is Tier C-3 from the conversation history — a separate phase if ever wanted.)
- **No predictive prefetching.**
- **No SCADA-style subscription/notification model.**
- **No write-back caching.** Writes always go straight through to the backend; cache invalidation happens on the write-response side, not by intercepting the write.
- **No cross-PLC caching.** Each PLC's cache is independent.
- **No persistence.** Process restart wipes the cache. Cache survives backend disconnects (the cached data was fresh when stored; disconnects don't retroactively invalidate it).
## Outputs (new files)
```
src/Mbproxy/Proxy/Cache/CacheKey.cs # reuses CoalescingKey shape; type-aliased or reflected
src/Mbproxy/Proxy/Cache/CacheEntry.cs # response bytes + expiry + lastFetched
src/Mbproxy/Proxy/Cache/ResponseCache.cs # the cache itself; TTL-based eviction, LRU under cap
src/Mbproxy/Proxy/Cache/CacheInvalidator.cs # address-range-overlap matcher for write invalidation
src/Mbproxy/Proxy/Cache/CacheLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Cache/CacheKeyTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheEntryTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheInvalidatorTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheE2ETests.cs
```
## Files modified
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame: cache check BEFORE coalescing; OnResponse: cache store + write invalidation
src/Mbproxy/Options/BcdTagOptions.cs # add CacheTtlMs (default 0 = no caching)
src/Mbproxy/Options/PlcOptions.cs # add DefaultCacheTtlMs
src/Mbproxy/Options/MbproxyOptions.cs # add Cache section (AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs)
src/Mbproxy/Bcd/BcdTag.cs # carry CacheTtlMs on the record
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolve per-tag TTL with per-PLC default fallback
src/Mbproxy/Proxy/ProxyCounters.cs # new: CacheHit, CacheMiss, CacheInvalidations, CacheEntryCount, CacheBytes
src/Mbproxy/Admin/StatusDto.cs # surface cache KPIs in PlcBackendStatus
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show cache-hit ratio per PLC row
src/Mbproxy/Configuration/ReloadValidator.cs # validate CacheTtlMs bounds; require AllowLongTtl=true for > 60s
docs/design.md # SUBSTANTIAL — see Task 1
docs/kpi.md # graduate cache KPIs from future to Tier 1
install/mbproxy.config.template.json # add CacheTtlMs examples + staleness commentary
mbproxy/CLAUDE.md # Architecture summary: add the cache-layer bullet
```
## Tasks
### 11.1 Design contract update — **DO THIS FIRST**
1. **`docs/design.md` updates** (review and land before writing implementation code):
**a. "What this is" section** — add the cache disclosure paragraph:
> As of Phase 11, the proxy gains an *optional* per-tag response cache with a bounded staleness window (`CacheTtlMs`). The cache is OFF by default (`CacheTtlMs = 0`) and must be opt-in per tag. With caching enabled, the proxy is no longer purely transparent — upstream reads may return a value up to `CacheTtlMs` milliseconds old. The 1:1 read-to-backend-request guarantee no longer holds; operators opting tags into caching MUST acknowledge the staleness bound.
**b. New section "Cache contract"** between "Rewriter" and "Failure modes":
- Cache populates on demand only. No polling.
- Cache entries carry their TTL with them. Hits older than TTL are evicted on access.
- FC06/FC16 successful responses invalidate cache entries whose address range overlaps the write.
- Cache survives backend disconnects (cached data was valid at cache time).
- Cache does NOT survive process restart.
- Multi-tag read range: effective TTL is the minimum of all configured tags in the range. Any tag with TTL = 0 in the range disables caching for the whole read.
- Cache stores POST-rewriter bytes (BCD already decoded). Hits bypass the rewriter entirely.
**c. "Failure modes" section** — add bullet on cache behaviour during backend recovery:
- Cache hits remain valid during a `recovering` listener state. Data was fresh when cached; recovery only affects future requests.
- Invalidations during recovery: writes that arrive cannot reach the backend, so the invalidation never happens. This is consistent — the write didn't take effect either. Cache entries remain valid until their TTL expires.
**d. "Rewriter" section** — clarify that the rewriter runs on the cache-miss path (decode on store), and that cache hits return pre-decoded bytes without re-invoking the rewriter.
Treat (a)-(d) as one atomic change. Get them reviewed, land them, then implement against the new contract.
### 11.2 Cache key
2. **`CacheKey`** — same shape as Phase 10's `CoalescingKey`: `readonly record struct CacheKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. If Phase 10 is already merged, prefer **a `using CacheKey = CoalescingKey;` alias** over a redefinition — same data, same hashing, single source of truth. If the two phases land together (Phase 10 + 11 in a coordinated release), consider renaming `CoalescingKey``ReadKey` to make the shared use site neutral.
### 11.3 Cache entry and storage
3. **`CacheEntry`** — `internal sealed record CacheEntry(byte[] PduBytes, DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc, int Length, ushort LastUsedTick)`. `LastUsedTick` is a monotonic counter for LRU ordering (avoids `DateTimeOffset.UtcNow` calls on every cache access).
4. **`ResponseCache`** — `internal sealed class ResponseCache : IDisposable`. Methods:
- `bool TryGet(CacheKey key, out CacheEntry entry)` — returns true ONLY if entry exists and `entry.ExpiresAtUtc > DateTimeOffset.UtcNow`. Updates `LastUsedTick` on hit. Expired entries removed lazily.
- `void Set(CacheKey key, CacheEntry entry)` — replaces any existing entry. If `Count >= MaxEntriesPerPlc`, evict the LRU entry first.
- `int Invalidate(byte unitId, ushort startAddress, ushort qty)` — delegates to `CacheInvalidator`. Returns count invalidated.
- `int Count { get; }`, `long ApproximateBytes { get; }`
- Background eviction loop (started in constructor, stopped in `Dispose`): every `EvictionIntervalMs` (default 5000), scans the map and removes entries past `ExpiresAtUtc`.
5. **`CacheInvalidator`** — pure logic: `static IEnumerable<CacheKey> FindOverlapping(IReadOnlyCollection<CacheKey> haystack, byte unitId, ushort writeStart, ushort writeQty)`. Returns keys whose range `[StartAddress, StartAddress + Qty)` intersects `[writeStart, writeStart + writeQty)`. Limit scope to keys matching `unitId` and `Fc in {3, 4}` (we never cache writes; invalidation only applies to read entries).
### 11.4 Multiplexer integration
6. **Cache lookup in `PlcMultiplexer.OnFrame`** — for FC03/04 requests when the read range has a non-zero resolved TTL:
```csharp
if (fc is 0x03 or 0x04 && resolvedTtlMs > 0) {
var key = new CacheKey(unitId, fc, startAddr, qty);
if (cache.TryGet(key, out var entry)) {
counters.IncrementCacheHit();
// Build a fresh MBAP wrapper for this client and send.
var hitFrame = BuildResponseFrame(entry.PduBytes, originalTxId, unitId);
upstreamPipe.SendResponse(hitFrame);
return; // no coalescing check, no backend round-trip
}
counters.IncrementCacheMiss();
}
// Fall through to Phase 10 coalescing path → Phase 9 send path
```
**Order matters:** cache check FIRST, then coalescing. A cache hit short-circuits everything; only on a miss do we engage Phase 10's coalescing logic.
7. **Cache store on response** — in the backend reader fan-out path, AFTER the rewriter has run on the response:
```csharp
if (req.Fc is 0x03 or 0x04 && req.ResolvedCacheTtlMs > 0) {
var key = new CacheKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
var now = DateTimeOffset.UtcNow;
var entry = new CacheEntry(
PduBytes: rewrittenPduBytes.ToArray(), // defensive copy
CachedAtUtc: now,
ExpiresAtUtc: now.AddMilliseconds(req.ResolvedCacheTtlMs),
Length: rewrittenPduBytes.Length,
LastUsedTick: NextLruTick());
cache.Set(key, entry);
}
```
Note: `req.ResolvedCacheTtlMs` is computed at request-receive time by walking the BcdTagMap for tags in `[StartAddress, StartAddress + Qty)` and taking `min(CacheTtlMs)`. If any tag has TTL = 0, `ResolvedCacheTtlMs = 0` and the whole read is uncached.
8. **Cache invalidation on write response** — FC06 / FC16 successful response (NOT exception response):
```csharp
if (req.Fc is 0x06 or 0x10 && (fc & 0x80) == 0) {
int invalidated = cache.Invalidate(req.UnitId, req.StartAddress, req.Qty);
if (invalidated > 0) {
counters.AddCacheInvalidations(invalidated);
CacheLogEvents.WriteInvalidatedEntries(logger, req.UnitId,
req.StartAddress, req.Qty, invalidated);
}
}
```
Invalidation is by ADDRESS RANGE OVERLAP, not by exact key match. A write to register 105 invalidates a cached read of [100..110] and a cached read of [105..115] but NOT a cached read of [200..210].
### 11.5 Per-tag TTL configuration
9. **`BcdTagOptions` extension:**
```csharp
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; }
public int CacheTtlMs { get; init; } = 0; // 0 = no caching (default)
}
```
10. **`PlcOptions.DefaultCacheTtlMs`** — applies to any tag whose explicit `CacheTtlMs` was not set (use a nullable `int?` on `BcdTagOptions` instead of `int = 0` to distinguish "explicitly zero" from "unset"). Default for the PLC default itself is 0.
11. **`MbproxyOptions.Cache` section:**
```csharp
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false; // gate for TTL > 60_000
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
```
12. **Validation** in `ReloadValidator`: `CacheTtlMs >= 0` always; `CacheTtlMs > 60_000` requires `Cache.AllowLongTtl = true`. Reject reloads that violate. Prevents "left at 1 hour by accident" deployments.
13. **`BcdTagMapBuilder.Build` resolution**: returns each `BcdTag` with `CacheTtlMs` resolved per fallback rules: explicit per-tag → per-PLC default → 0.
### 11.6 Counters and status surfacing
14. **`ProxyCounters` additions:**
- `CacheHitCount` (Interlocked long)
- `CacheMissCount` (Interlocked long)
- `CacheInvalidations` (Interlocked long)
- `CacheEntryCount` (snapshot from `ResponseCache.Count` — read-time)
- `CacheBytes` (snapshot from `ResponseCache.ApproximateBytes` — read-time)
15. **`StatusDto.PlcBackendStatus` extension:**
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, long CoalescedMissCount, long CoalescedResponseToDeadUpstream, // Phase 10
long CacheHitCount, long CacheMissCount, // Phase 11
long CacheInvalidations, long CacheEntryCount, long CacheBytes); // Phase 11
```
16. **HTML page** — add a compact `Cache: 73%` cell per PLC row. Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 11.7 Documentation and template
17. **`docs/kpi.md`** — graduate cache-hit-ratio KPIs from "deferred / future" to Tier 1 supported. Add `cacheEntryCount` and `cacheBytes` as Tier 2 memory-watch KPIs.
18. **`install/mbproxy.config.template.json`** — add a fully-commented `Mbproxy.Cache` section showing `AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`. Show example per-tag `CacheTtlMs: 1000` and per-PLC `DefaultCacheTtlMs: 500` entries. Include a prominent comment explaining the staleness contract: "**clients reading these tags will see values up to `CacheTtlMs` milliseconds old**".
19. **`mbproxy/CLAUDE.md` Architecture summary** — add a bullet:
> - **Optional response cache** with per-tag TTL (default 0 = off). Cached FC03/04 responses serve subsequent same-key reads without backend traffic; FC06/FC16 write responses invalidate overlapping entries by address range.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Cache;
internal readonly record struct CacheKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed record CacheEntry(
byte[] PduBytes,
DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc,
int Length, ushort LastUsedTick);
internal sealed class ResponseCache : IDisposable {
public bool TryGet(CacheKey key, out CacheEntry entry);
public void Set(CacheKey key, CacheEntry entry);
public int Invalidate(byte unitId, ushort startAddress, ushort qty);
public int Count { get; }
public long ApproximateBytes { get; }
public void Dispose();
}
internal static class CacheInvalidator {
public static IEnumerable<CacheKey> FindOverlapping(
IReadOnlyCollection<CacheKey> haystack,
byte unitId, ushort writeStart, ushort writeQty);
}
```
```csharp
namespace Mbproxy.Options;
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false;
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
// Added field on MbproxyOptions:
public CacheOptions Cache { get; init; } = new();
// Added field on BcdTagOptions (nullable to distinguish "unset" from "explicitly 0"):
public int? CacheTtlMs { get; init; }
// Added field on PlcOptions:
public int DefaultCacheTtlMs { get; init; } = 0;
```
`ProxyCounters` and `CounterSnapshot` gain 5 new long fields. No public-surface removals or renames.
## Tests required
### Unit (`Category = Unit`)
**`CacheKeyTests`** (≥ 3 tests): equality across identical keys; FC03 vs FC04 differs; UnitId differs.
**`CacheEntryTests`** (≥ 3 tests): expired detection at boundary; immutability of `PduBytes`; LRU tick monotonicity.
**`CacheInvalidatorTests`** (≥ 5 tests, range-overlap math):
1. `FullOverlap_WriteCoversEntryRange_Invalidates`
2. `PartialOverlap_WriteStartsBeforeEntry_Invalidates`
3. `PartialOverlap_WriteEndsAfterEntry_Invalidates`
4. `Adjacent_NotOverlapping_DoesNotInvalidate` — write to `[10..15]` does NOT invalidate cached `[15..20]` (half-open intervals — `15` is not in the entry's range).
5. `NoOverlap_DoesNotInvalidate`
6. `DifferentUnitId_DoesNotInvalidate`
**`ResponseCacheTests`** (≥ 8 tests):
1. `SetThenGet_RoundTrips`
2. `GetExpiredEntry_ReturnsFalse_AndRemoves` — uses a small TTL + `Task.Delay`
3. `Invalidate_OverlappingRange_RemovesMatching` — set 3 entries, invalidate a range overlapping 2 of them, verify Count drops by 2
4. `Invalidate_OnlyAffectsFc03Fc04_KeysWithFcOther_NotTouched` — there shouldn't be FC06/FC16 entries in cache, but a defensive test
5. `Set_AtMaxEntries_EvictsLRU`
6. `LRU_TracksAccessOrder_Across_Get_And_Set`
7. `Concurrent_GetSet_NoDataRace` — 100 tasks, 1000 ops each
8. `Dispose_StopsEvictionLoop`
### E2E (`Category = E2E`)
**`ResponseCacheE2ETests`** (≥ 6 tests, against pymodbus simulator):
1. `E2E_CacheHit_AfterFirstRead_NoBackendTraffic` — configure tag at HR1072 with `CacheTtlMs = 5000`; first read goes to backend; second read within 5s hits cache. Verify via the simulator's HTTP introspection or by timing (cache hits return ~ms; backend reads return ~10ms).
2. `E2E_CacheExpires_AfterTtl_NextReadHitsBackend` — short TTL (e.g., 200 ms); after delay, second read goes to backend.
3. `E2E_WriteInvalidatesOverlappingCacheEntries` — read HR1072 (cache it), write to HR1072 with FC06, next read MUST miss cache and re-fetch.
4. `E2E_NonOverlappingWrite_DoesNotInvalidate` — read HR1072 (cache it), write to HR1080, next read of HR1072 still hits cache.
5. `E2E_BcdDecodedBytesAreCached_NotRawBcd` — cache hit returns the decoded `1234`, not `0x1234`. Proves the cache stores post-rewriter bytes.
6. `E2E_DisablingCache_ViaHotReload_FlushesEntries` — set `CacheTtlMs = 1000` on a tag, do a read (cached), hot-reload with `CacheTtlMs = 0`, next read must hit the backend even though the old entry is still within its TTL window.
7. `E2E_MultiTagRead_RangeWithZeroTtlTag_DisablesCaching` — read [100..110] where one tag in the range has `CacheTtlMs = 0`; verify no caching of the whole read.
## Phase gate
- [ ] **`docs/design.md` updates from Task 1 are merged FIRST** (or in the same PR). The contract change is not optional and not deferrable. Gate fail otherwise.
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — the **4 critical Phase-9 regression guards** + **Phase 10's coalescing tests**.
- [ ] All new unit + e2e tests pass (≥ 25 new).
- [ ] **Default TTL = 0 → no observable behavior change vs Phase 10.** Verify: run the full Phase 10 test suite with the Phase 11 build; everything green.
- [ ] **Headline assertion (E2E):** configure `CacheTtlMs = 1000` on HR1072; issue 10 reads at 100 ms intervals; backend (stub or sim with introspection) sees exactly 1 backend round-trip.
- [ ] Write invalidation correctly handles all 6 range-overlap cases (full, two partial, adjacent, none, different-unit-id).
- [ ] Memory cap enforced: with `MaxEntriesPerPlc = 5`, 6 distinct cache inserts produce 5 entries (one LRU eviction observed).
- [ ] Validation rejects `CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`.
- [ ] Hot-reload of `CacheTtlMs` flushes entries for the affected tag (or, simpler: flushes the entire cache for the PLC). Pick the simpler option (PLC-wide flush) and document.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
- [ ] `docs/kpi.md` Tier 1 includes cache-hit-ratio.
- [ ] `install/mbproxy.config.template.json` includes the new `Mbproxy.Cache` block with the staleness commentary.
## Out of scope
- **Active polling** — cache populates on demand only. No background poll loop.
- **Predictive prefetching** — no speculative reads.
- **Range-overlap coalescing of cache entries** — if reads `[100..110]` and `[105..115]` are both cached, no attempt to merge them into one `[100..115]` entry. Same-key only.
- **Cross-PLC caching** — each PLC's cache is independent. No optimisation across PLCs.
- **Persistence** — process restart wipes the cache. No file/Redis backing store.
- **Cache warming** — no pre-populating the cache from a snapshot, last-known-good file, etc.
- **TTL > 60 seconds without explicit `AllowLongTtl` opt-in** — refused at validation.
- **Adaptive TTL** — operator-configured only. No auto-tuning.
## Subagent briefing
If you're the agent picking up this phase:
1. **Task 1 is design.md, not code.** The contract update is the gate. Do not write the cache code until the design changes have been reviewed and merged (or are in the same PR with explicit reviewer attention). A reviewer who lands the code without the design update has failed the gate, and so have you.
2. **Default TTL = 0 means default behavior = Phase 10 unchanged.** Critical for backwards-compat. Every existing test that doesn't set `CacheTtlMs` must continue to pass without modification.
3. **Cache stores POST-rewriter bytes.** The rewriter runs once on the cache-miss path; subsequent hits return cached decoded bytes directly. Do not re-invoke the rewriter on hits — wastes CPU and changes nothing.
4. **Write-invalidation is by ADDRESS RANGE OVERLAP, not by exact key match.** A write to register 105 invalidates a cached read of `[100..110]`. Use half-open interval math: write `[w, w+q)` overlaps entry `[s, s+n)` iff `w < s+n && s < w+q`.
5. **Multi-tag read range: effective TTL is `min(TTLs)`.** If any tag in the read range has TTL = 0, the whole read is uncached. Conservative-by-design.
6. **Cache lookup happens BEFORE coalescing.** Order: cache check → cache miss → coalescing check (Phase 10) → backend send (Phase 9). A cache hit short-circuits everything.
7. **`CacheKey` is structurally identical to `CoalescingKey`.** Prefer aliasing over redefinition. If the two phases land together, rename the shared type to `ReadKey` to make the joint use site neutral.
8. **MBAP TxId restoration on cache-hit responses.** The cache stores the PDU bytes (post-rewriter); on hit, build a fresh MBAP wrapper with the requesting client's `OriginalTxId`. There's no cached MBAP — the per-request TxId is supplied by the upstream pipe's request.
9. **Hot-reload of `CacheTtlMs`: flush the whole PLC cache on any tag-list change.** Tag-level granularity is technically possible but complicates the reload code path. The simple correctness move is "any tag-list change to this PLC → drop all cached entries for this PLC and let them re-populate." Document the choice.
10. **Eviction loop: `PeriodicTimer` + cancellation token.** Not `System.Timers.Timer`. The cache is `IDisposable`; the loop honours `Dispose`.
11. **Update `docs/design.md` AND `docs/kpi.md` AND `mbproxy/CLAUDE.md` AND `install/mbproxy.config.template.json` IN THE SAME PR AS THE CODE.** Doc drift is a gate fail. The architectural pivot must be visible across all reader-facing surfaces.
## Cross-references
- Phase 9's multiplexer is the chokepoint that hosts the cache check: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- Phase 10's `CoalescingKey` is the same shape as Phase 11's `CacheKey`: [`10-read-coalescing.md`](10-read-coalescing.md).
- The "not a polling/cache layer" stance that this phase pivots away from: [`../design.md`](../design.md) → "What this is" + "Purpose".
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (cache-hit-ratio joins this tier).
- Resolution rules for per-tag `CacheTtlMs` (Global Add Remove fallback + per-PLC default): [`../design.md`](../design.md) → "Hybrid tag resolution".
+107
View File
@@ -0,0 +1,107 @@
# mbproxy — implementation plan
Phase-by-phase implementation plan for the `mbproxy` service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in [`../design.md`](../design.md) on 2026-05-13.
**Briefing a subagent for a phase:** hand it exactly three documents — the phase doc, [`../design.md`](../design.md), and [`../../DL260/dl205.md`](../../DL260/dl205.md). Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.
## Phase graph
| # | Phase | Depends on | Parallel-safe with |
|---|-------|------------|--------------------|
| 00 | [Bootstrap](00-bootstrap.md) — host + DI + Serilog + options POCOs | — | (must run first, alone) |
| 01 | [Simulator harness](01-simulator-harness.md) — pymodbus xUnit fixture | 00 | 02 |
| 02 | [BCD codec](02-bcd-codec.md) — pure encode/decode logic | 00 | 01, 03 |
| 03 | [Proxy plumbing](03-proxy-plumbing.md) — TcpListener + 1:1 byte forwarder | 00 | 02 |
| 04 | [Rewriter integration](04-rewriter-integration.md) — wire codec into proxy | 02, 03 | — |
| 05 | [Listener supervisor](05-listener-supervisor.md) — Polly auto-recovery | 03 | — |
| 06 | [Hot-reload](06-hot-reload.md) — `IOptionsMonitor` reconcile | 05 | — |
| 07 | [Status page](07-status-page.md) — Kestrel admin endpoint | 05, 06 | — |
| 08 | [Service hardening](08-service-hardening.md) — Windows service + shutdown | 04, 07 | — |
| 09 | [TxId multiplexing](09-txid-multiplexing.md) — single backend connection per PLC (post-1.0 follow-on) | 04, 05, 07 | — |
| 10 | [Read coalescing](10-read-coalescing.md) — in-flight FC03/04 dedup (post-1.0 follow-on) | 09 | — |
| 11 | [Response cache](11-response-cache.md) — short-TTL post-response cache, bounded staleness (post-1.0; **design-contract pivot**) | 10 | — |
```
┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
└── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
└─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)
```
**Phases 09, 10, and 11 are post-1.0 follow-ons**, not part of the initial 1.0 release.
- **Phase 09** rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
- **Phase 10** plugs into Phase 09's `InterestedParties` seam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians).
- **Phase 11** extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. **This is a deliberate design-contract pivot** — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.
## Working with subagents
### Default: one subagent per phase, sequential
Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:
- Its own phase doc (under this directory).
- [`../design.md`](../design.md) — architecture, the source of truth.
- [`../../DL260/dl205.md`](../../DL260/dl205.md) — device quirks.
That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in [`../design.md`](../design.md), not silently in code.
### Advanced: parallel subagents within a single phase boundary
Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:
- **Phase 01 ∥ Phase 02** (sim harness lives in `tests/sim/`, codec lives in `src/Mbproxy/Bcd/` — fully disjoint).
- **Phase 02 ∥ Phase 03** (codec is pure logic in `src/Mbproxy/Bcd/`; plumbing is in `src/Mbproxy/Proxy/` — disjoint).
- **Phase 01 + Phase 02 + Phase 03** all three at once is also safe (all touch different directories).
**Required pattern:**
1. Spawn each parallel agent with `isolation: "worktree"` (Agent tool's worktree mode creates an isolated git checkout).
2. Each agent gets ONE phase doc + design.md + dl205.md.
3. Each agent runs its phase gate locally before its worktree is committed.
4. Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
5. After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.
**Hard rules — anti-patterns that break parallel work:**
- ❌ Any two phases editing the same `.csproj` PackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate `<ItemGroup>` blocks or sequential merges.
- ❌ Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
- ❌ Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
- ❌ Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).
## Phase gate template
Every phase MUST be green on all of these before its branch is merged:
1. **Build is clean.** `dotnet build src/Mbproxy/Mbproxy.csproj -c Debug` with **zero warnings**. `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` is set in phase 00 and stays set forever.
2. **All unit tests pass.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2E` is green.
3. **E2E tests pass when the simulator is available.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2m` is green on a machine with Python + pymodbus installed. The `--blame-hang-timeout` is mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a `[Fact(Timeout = …)]` per the Test discipline rules below.
4. **No regressions in any prior phase's tests.** The full suite stays green.
5. **No new public types beyond what the phase doc declares.** Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
6. **No `TODO` / `FIXME` / `HACK` comments committed.** Either resolve or file in the [Deferred](#deferred) section below.
7. **Design / docs are in sync.** If a design decision changed during the phase, [`../design.md`](../design.md) is updated in the same PR — and only mirror to [`../../CLAUDE.md`](../../CLAUDE.md)'s Architecture summary if the change shifts one of the headline bullets.
8. **Phase doc itself is updated** to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.
## Test discipline
- **Framework:** xUnit (v3 if available, v2 otherwise) + **Shouldly** for assertions. Never `Assert.Equal(x, y)` — always `y.ShouldBe(x)`. Never `Assert.True(p)` — always `p.ShouldBeTrue("reason")`.
- **Categories:** `[Trait("Category", "Unit")]` (default; no traits needed), `[Trait("Category", "E2E")]` (needs simulator), `[Trait("Category", "Stress")]` (slow / load-bearing — opt-in only).
- **No mocks for code we own.** Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
- **Test naming:** `MethodOrScenario_Condition_ExpectedOutcome`. Example: `BcdCodec_Decode16_Returns1234_For0x1234`.
- **One assertion per test where reasonable.** Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
- **Every `[Trait("Category","E2E")]` test MUST declare a hard timeout** via `[Fact(Timeout = N)]` (xUnit v3, milliseconds). **Default: `5_000` ms.** Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that **do not honor `TestContext.Current.CancellationToken`** — without `[Fact(Timeout=…)]`, a deadlock in the proxy hangs the runner indefinitely. The same rule applies to `[Trait("Category","Stress")]`. Unit tests are exempt unless they touch real sockets or processes.
- **Run E2E with a hang backstop.** The phase gate's E2E command is `dotnet test ... --filter Category=E2E --blame-hang-timeout 2m`. The `--blame-hang-timeout` is a process-level safety net in case a test's individual `Timeout` somehow doesn't fire (e.g. an unmanaged thread blocking finalization).
## Deferred
A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:
- *(none yet)*
## Cross-references
- Architecture and load-bearing decisions: [`../design.md`](../design.md)
- Device quirks the proxy must respect: [`../../DL260/dl205.md`](../../DL260/dl205.md)
- pymodbus simulator profile that backs e2e tests: [`../../DL260/dl205.json`](../../DL260/dl205.json)
- As-deployed PLC parameters (port 502, BCD-by-default, swap bytes, etc.): [`../../DL260/mbtcp_settings.JPG`](../../DL260/mbtcp_settings.JPG)