Adds the mbproxy service end-to-end. Phases 00-08 implement the production-ready single-listener / 1:1-backend transparent Modbus TCP proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260 fleet. Phase 9 replaces the connection layer with a single backend socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Phase 9 additions of note: - PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap - InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing for Phase 10 read coalescing — do not collapse to a single field) - Per-request watchdog: surfaces Modbus exception 0x0B to upstream on BackendRequestTimeoutMs, defending against lost responses, dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed- request bug (its ServerRequestHandler.last_pdu state race) - Status DTO + HTML gain inFlight / maxInFlight / txIdWraps / disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md) Tests: 263 unit + 38 E2E. Multiplexer correctness under truly concurrent backend traffic is proved against a stub backend in PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus 3.13's single-PDU framer stays in known-good mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
mbproxy — implementation plan
Phase-by-phase implementation plan for the mbproxy service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in ../design.md on 2026-05-13.
Briefing a subagent for a phase: hand it exactly three documents — the phase doc, ../design.md, and ../../DL260/dl205.md. Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.
Phase graph
| # | Phase | Depends on | Parallel-safe with |
|---|---|---|---|
| 00 | Bootstrap — host + DI + Serilog + options POCOs | — | (must run first, alone) |
| 01 | Simulator harness — pymodbus xUnit fixture | 00 | 02 |
| 02 | BCD codec — pure encode/decode logic | 00 | 01, 03 |
| 03 | Proxy plumbing — TcpListener + 1:1 byte forwarder | 00 | 02 |
| 04 | Rewriter integration — wire codec into proxy | 02, 03 | — |
| 05 | Listener supervisor — Polly auto-recovery | 03 | — |
| 06 | Hot-reload — IOptionsMonitor reconcile |
05 | — |
| 07 | Status page — Kestrel admin endpoint | 05, 06 | — |
| 08 | Service hardening — Windows service + shutdown | 04, 07 | — |
| 09 | TxId multiplexing — single backend connection per PLC (post-1.0 follow-on) | 04, 05, 07 | — |
| 10 | Read coalescing — in-flight FC03/04 dedup (post-1.0 follow-on) | 09 | — |
| 11 | Response cache — short-TTL post-response cache, bounded staleness (post-1.0; design-contract pivot) | 10 | — |
┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
└── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
│
└─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)
Phases 09, 10, and 11 are post-1.0 follow-ons, not part of the initial 1.0 release.
- Phase 09 rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
- Phase 10 plugs into Phase 09's
InterestedPartiesseam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians). - Phase 11 extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. This is a deliberate design-contract pivot — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.
Working with subagents
Default: one subagent per phase, sequential
Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:
- Its own phase doc (under this directory).
../design.md— architecture, the source of truth.../../DL260/dl205.md— device quirks.
That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in ../design.md, not silently in code.
Advanced: parallel subagents within a single phase boundary
Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:
- Phase 01 ∥ Phase 02 (sim harness lives in
tests/sim/, codec lives insrc/Mbproxy/Bcd/— fully disjoint). - Phase 02 ∥ Phase 03 (codec is pure logic in
src/Mbproxy/Bcd/; plumbing is insrc/Mbproxy/Proxy/— disjoint). - Phase 01 + Phase 02 + Phase 03 all three at once is also safe (all touch different directories).
Required pattern:
- Spawn each parallel agent with
isolation: "worktree"(Agent tool's worktree mode creates an isolated git checkout). - Each agent gets ONE phase doc + design.md + dl205.md.
- Each agent runs its phase gate locally before its worktree is committed.
- Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
- After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.
Hard rules — anti-patterns that break parallel work:
- ❌ Any two phases editing the same
.csprojPackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate<ItemGroup>blocks or sequential merges. - ❌ Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
- ❌ Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
- ❌ Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).
Phase gate template
Every phase MUST be green on all of these before its branch is merged:
- Build is clean.
dotnet build src/Mbproxy/Mbproxy.csproj -c Debugwith zero warnings.<TreatWarningsAsErrors>true</TreatWarningsAsErrors>is set in phase 00 and stays set forever. - All unit tests pass.
dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2Eis green. - E2E tests pass when the simulator is available.
dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2mis green on a machine with Python + pymodbus installed. The--blame-hang-timeoutis mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a[Fact(Timeout = …)]per the Test discipline rules below. - No regressions in any prior phase's tests. The full suite stays green.
- No new public types beyond what the phase doc declares. Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
- No
TODO/FIXME/HACKcomments committed. Either resolve or file in the Deferred section below. - Design / docs are in sync. If a design decision changed during the phase,
../design.mdis updated in the same PR — and only mirror to../../CLAUDE.md's Architecture summary if the change shifts one of the headline bullets. - Phase doc itself is updated to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.
Test discipline
- Framework: xUnit (v3 if available, v2 otherwise) + Shouldly for assertions. Never
Assert.Equal(x, y)— alwaysy.ShouldBe(x). NeverAssert.True(p)— alwaysp.ShouldBeTrue("reason"). - Categories:
[Trait("Category", "Unit")](default; no traits needed),[Trait("Category", "E2E")](needs simulator),[Trait("Category", "Stress")](slow / load-bearing — opt-in only). - No mocks for code we own. Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
- Test naming:
MethodOrScenario_Condition_ExpectedOutcome. Example:BcdCodec_Decode16_Returns1234_For0x1234. - One assertion per test where reasonable. Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
- Every
[Trait("Category","E2E")]test MUST declare a hard timeout via[Fact(Timeout = N)](xUnit v3, milliseconds). Default:5_000ms. Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that do not honorTestContext.Current.CancellationToken— without[Fact(Timeout=…)], a deadlock in the proxy hangs the runner indefinitely. The same rule applies to[Trait("Category","Stress")]. Unit tests are exempt unless they touch real sockets or processes. - Run E2E with a hang backstop. The phase gate's E2E command is
dotnet test ... --filter Category=E2E --blame-hang-timeout 2m. The--blame-hang-timeoutis a process-level safety net in case a test's individualTimeoutsomehow doesn't fire (e.g. an unmanaged thread blocking finalization).
Deferred
A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:
- (none yet)
Cross-references
- Architecture and load-bearing decisions:
../design.md - Device quirks the proxy must respect:
../../DL260/dl205.md - pymodbus simulator profile that backs e2e tests:
../../DL260/dl205.json - As-deployed PLC parameters (port 502, BCD-by-default, swap bytes, etc.):
../../DL260/mbtcp_settings.JPG