Files
wwtools/mbproxy/docs/plan
Joseph Doherty 1db900edef mbproxy: add opt-in response cache (Phase 11)
Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.

Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).

New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
  the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
  loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
  mbproxy.cache.* names).

Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).

Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).

Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.

Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.

PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.

Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
  overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
  unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
  dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
  coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
  overlapping, non-overlapping write does not invalidate, multi-tag
  TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
  works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
  * Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
  * TTL expiry path with sleep > TTL.
  * Write invalidation through the proxy on a scratch register.
  * BCD-decoded bytes are cached, not raw BCD nibbles.
  * Regression: Cache disabled by default -> behaviour byte-identical to
    Phase 10.

Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.

Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:08:51 -04:00
..

mbproxy — implementation plan

Phase-by-phase implementation plan for the mbproxy service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in ../design.md on 2026-05-13.

Briefing a subagent for a phase: hand it exactly three documents — the phase doc, ../design.md, and ../../DL260/dl205.md. Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.

Phase graph

# Phase Depends on Parallel-safe with
00 Bootstrap — host + DI + Serilog + options POCOs (must run first, alone)
01 Simulator harness — pymodbus xUnit fixture 00 02
02 BCD codec — pure encode/decode logic 00 01, 03
03 Proxy plumbing — TcpListener + 1:1 byte forwarder 00 02
04 Rewriter integration — wire codec into proxy 02, 03
05 Listener supervisor — Polly auto-recovery 03
06 Hot-reloadIOptionsMonitor reconcile 05
07 Status page — Kestrel admin endpoint 05, 06
08 Service hardening — Windows service + shutdown 04, 07
09 TxId multiplexing — single backend connection per PLC (post-1.0 follow-on) 04, 05, 07
10 Read coalescing — in-flight FC03/04 dedup (post-1.0 follow-on) 09
11 Response cache — short-TTL post-response cache, bounded staleness (post-1.0; design-contract pivot) 10
        ┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
        └── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
                              │
                              └─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)

Phases 09, 10, and 11 are post-1.0 follow-ons, not part of the initial 1.0 release.

  • Phase 09 rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
  • Phase 10 plugs into Phase 09's InterestedParties seam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians).
  • Phase 11 extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. This is a deliberate design-contract pivot — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.

Working with subagents

Default: one subagent per phase, sequential

Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:

That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in ../design.md, not silently in code.

Advanced: parallel subagents within a single phase boundary

Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:

  • Phase 01 ∥ Phase 02 (sim harness lives in tests/sim/, codec lives in src/Mbproxy/Bcd/ — fully disjoint).
  • Phase 02 ∥ Phase 03 (codec is pure logic in src/Mbproxy/Bcd/; plumbing is in src/Mbproxy/Proxy/ — disjoint).
  • Phase 01 + Phase 02 + Phase 03 all three at once is also safe (all touch different directories).

Required pattern:

  1. Spawn each parallel agent with isolation: "worktree" (Agent tool's worktree mode creates an isolated git checkout).
  2. Each agent gets ONE phase doc + design.md + dl205.md.
  3. Each agent runs its phase gate locally before its worktree is committed.
  4. Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
  5. After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.

Hard rules — anti-patterns that break parallel work:

  • Any two phases editing the same .csproj PackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate <ItemGroup> blocks or sequential merges.
  • Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
  • Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
  • Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).

Phase gate template

Every phase MUST be green on all of these before its branch is merged:

  1. Build is clean. dotnet build src/Mbproxy/Mbproxy.csproj -c Debug with zero warnings. <TreatWarningsAsErrors>true</TreatWarningsAsErrors> is set in phase 00 and stays set forever.
  2. All unit tests pass. dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2E is green.
  3. E2E tests pass when the simulator is available. dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2m is green on a machine with Python + pymodbus installed. The --blame-hang-timeout is mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a [Fact(Timeout = …)] per the Test discipline rules below.
  4. No regressions in any prior phase's tests. The full suite stays green.
  5. No new public types beyond what the phase doc declares. Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
  6. No TODO / FIXME / HACK comments committed. Either resolve or file in the Deferred section below.
  7. Design / docs are in sync. If a design decision changed during the phase, ../design.md is updated in the same PR — and only mirror to ../../CLAUDE.md's Architecture summary if the change shifts one of the headline bullets.
  8. Phase doc itself is updated to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.

Test discipline

  • Framework: xUnit (v3 if available, v2 otherwise) + Shouldly for assertions. Never Assert.Equal(x, y) — always y.ShouldBe(x). Never Assert.True(p) — always p.ShouldBeTrue("reason").
  • Categories: [Trait("Category", "Unit")] (default; no traits needed), [Trait("Category", "E2E")] (needs simulator), [Trait("Category", "Stress")] (slow / load-bearing — opt-in only).
  • No mocks for code we own. Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
  • Test naming: MethodOrScenario_Condition_ExpectedOutcome. Example: BcdCodec_Decode16_Returns1234_For0x1234.
  • One assertion per test where reasonable. Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
  • Every [Trait("Category","E2E")] test MUST declare a hard timeout via [Fact(Timeout = N)] (xUnit v3, milliseconds). Default: 5_000 ms. Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that do not honor TestContext.Current.CancellationToken — without [Fact(Timeout=…)], a deadlock in the proxy hangs the runner indefinitely. The same rule applies to [Trait("Category","Stress")]. Unit tests are exempt unless they touch real sockets or processes.
  • Run E2E with a hang backstop. The phase gate's E2E command is dotnet test ... --filter Category=E2E --blame-hang-timeout 2m. The --blame-hang-timeout is a process-level safety net in case a test's individual Timeout somehow doesn't fire (e.g. an unmanaged thread blocking finalization).

Deferred

A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:

  • (none yet)

Cross-references