Files
wwtools/mbproxy/docs/plan
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00
..

mbproxy — implementation plan

Phase-by-phase implementation plan for the mbproxy service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in ../design.md on 2026-05-13.

Briefing a subagent for a phase: hand it exactly three documents — the phase doc, ../design.md, and ../../DL260/dl205.md. Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.

Phase graph

# Phase Depends on Parallel-safe with
00 Bootstrap — host + DI + Serilog + options POCOs (must run first, alone)
01 Simulator harness — pymodbus xUnit fixture 00 02
02 BCD codec — pure encode/decode logic 00 01, 03
03 Proxy plumbing — TcpListener + 1:1 byte forwarder 00 02
04 Rewriter integration — wire codec into proxy 02, 03
05 Listener supervisor — Polly auto-recovery 03
06 Hot-reloadIOptionsMonitor reconcile 05
07 Status page — Kestrel admin endpoint 05, 06
08 Service hardening — Windows service + shutdown 04, 07
09 TxId multiplexing — single backend connection per PLC (post-1.0 follow-on) 04, 05, 07
10 Read coalescing — in-flight FC03/04 dedup (post-1.0 follow-on) 09
11 Response cache — short-TTL post-response cache, bounded staleness (post-1.0; design-contract pivot) 10
        ┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
        └── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
                              │
                              └─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)

Phases 09, 10, and 11 are post-1.0 follow-ons, not part of the initial 1.0 release.

  • Phase 09 rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
  • Phase 10 plugs into Phase 09's InterestedParties seam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians).
  • Phase 11 extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. This is a deliberate design-contract pivot — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.

Working with subagents

Default: one subagent per phase, sequential

Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:

That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in ../design.md, not silently in code.

Advanced: parallel subagents within a single phase boundary

Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:

  • Phase 01 ∥ Phase 02 (sim harness lives in tests/sim/, codec lives in src/Mbproxy/Bcd/ — fully disjoint).
  • Phase 02 ∥ Phase 03 (codec is pure logic in src/Mbproxy/Bcd/; plumbing is in src/Mbproxy/Proxy/ — disjoint).
  • Phase 01 + Phase 02 + Phase 03 all three at once is also safe (all touch different directories).

Required pattern:

  1. Spawn each parallel agent with isolation: "worktree" (Agent tool's worktree mode creates an isolated git checkout).
  2. Each agent gets ONE phase doc + design.md + dl205.md.
  3. Each agent runs its phase gate locally before its worktree is committed.
  4. Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
  5. After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.

Hard rules — anti-patterns that break parallel work:

  • Any two phases editing the same .csproj PackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate <ItemGroup> blocks or sequential merges.
  • Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
  • Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
  • Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).

Phase gate template

Every phase MUST be green on all of these before its branch is merged:

  1. Build is clean. dotnet build src/Mbproxy/Mbproxy.csproj -c Debug with zero warnings. <TreatWarningsAsErrors>true</TreatWarningsAsErrors> is set in phase 00 and stays set forever.
  2. All unit tests pass. dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2E is green.
  3. E2E tests pass when the simulator is available. dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2m is green on a machine with Python + pymodbus installed. The --blame-hang-timeout is mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a [Fact(Timeout = …)] per the Test discipline rules below.
  4. No regressions in any prior phase's tests. The full suite stays green.
  5. No new public types beyond what the phase doc declares. Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
  6. No TODO / FIXME / HACK comments committed. Either resolve or file in the Deferred section below.
  7. Design / docs are in sync. If a design decision changed during the phase, ../design.md is updated in the same PR — and only mirror to ../../CLAUDE.md's Architecture summary if the change shifts one of the headline bullets.
  8. Phase doc itself is updated to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.

Test discipline

  • Framework: xUnit (v3 if available, v2 otherwise) + Shouldly for assertions. Never Assert.Equal(x, y) — always y.ShouldBe(x). Never Assert.True(p) — always p.ShouldBeTrue("reason").
  • Categories: [Trait("Category", "Unit")] (default; no traits needed), [Trait("Category", "E2E")] (needs simulator), [Trait("Category", "Stress")] (slow / load-bearing — opt-in only).
  • No mocks for code we own. Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
  • Test naming: MethodOrScenario_Condition_ExpectedOutcome. Example: BcdCodec_Decode16_Returns1234_For0x1234.
  • One assertion per test where reasonable. Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
  • Every [Trait("Category","E2E")] test MUST declare a hard timeout via [Fact(Timeout = N)] (xUnit v3, milliseconds). Default: 5_000 ms. Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that do not honor TestContext.Current.CancellationToken — without [Fact(Timeout=…)], a deadlock in the proxy hangs the runner indefinitely. The same rule applies to [Trait("Category","Stress")]. Unit tests are exempt unless they touch real sockets or processes.
  • Run E2E with a hang backstop. The phase gate's E2E command is dotnet test ... --filter Category=E2E --blame-hang-timeout 2m. The --blame-hang-timeout is a process-level safety net in case a test's individual Timeout somehow doesn't fire (e.g. an unmanaged thread blocking finalization).

Deferred

A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:

  • (none yet)

Cross-references