mbproxy: add in-flight read coalescing (Phase 10)

When two or more upstream clients send the same FC03/FC04 read while a
matching request is already in flight on the same PLC's multiplexed
backend socket, attach the late arrivals to the existing InFlightRequest
.InterestedParties list instead of opening a second backend round-trip.
The single backend response fans out to every attached party with each
party's original MBAP TxId restored individually. Zero post-response
staleness — coalescing operates entirely within the in-flight window
(microseconds to ~10 ms typical); the proxy is NOT a cache layer.

Headline mechanism:

- New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys
  the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus
  tables and never share a key; different unit IDs never coalesce;
  writes (FC06/FC16) bypass the coalescing path entirely.
- InFlightByKeyMap uses a simple lock around a Dictionary; atomic
  TryAttachOrCreate either appends a new party to the in-flight
  request's mutable List<InterestedParty> or invokes a factory to
  build a fresh entry. Per-entry MaxParties cap (default 32) bounds
  fan-out cost; past the cap, the next arrival opens a new entry.
- PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for
  FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The
  factory closure does the Phase-9 work (allocate TxId, add to
  CorrelationMap); the channel send happens AFTER returning from
  TryAttachOrCreate so the map lock is not held across the async send.
- Response fan-out in RunBackendReaderAsync removes the entry from
  InFlightByKeyMap before iterating InterestedParties, ensuring no
  concurrent attach can mutate the list during iteration.
- Cascade + watchdog paths also drain the key map so a stale entry
  cannot outlive its backend round-trip.

Counter accounting balance (per snapshot): CoalescedHitCount +
CoalescedMissCount equals total FC03 + FC04 requests since startup.
Even with coalescing disabled, every read still bumps Miss so dashboard
math stays balanced.

New surface (additive only):
- src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs
- src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs
- src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs
- ReadCoalescingOptions on ResilienceOptions
- CoalescedHitCount / CoalescedMissCount /
  CoalescedResponseToDeadUpstream counters surfaced on /status.json
  per PLC and as a compact "Coal" cell on the HTML status page.

Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire
previously read the same register from both clients (which now
coalesces). Patched to read two different addresses so the test still
proves distinct backend TxIds without violating the coalescing
contract.

Tests added: 24 new (19 unit + 5 E2E):
- CoalescingKeyTests (5)
- InFlightByKeyMapTests (6, includes concurrent stress)
- ReadCoalescingTests (8, stub-backend with deterministic delay)
- ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active
  during overlap is proven against the stub, not the sim, due to
  pymodbus 3.13's known concurrent-frame bug)

Total: 325 tests passing (282 unit + 43 E2E).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 02:26:06 -04:00
parent 56eee3c563
commit a2dba4bd07
25 changed files with 1888 additions and 52 deletions
+19 -1
View File
@@ -2,7 +2,7 @@
When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.
**Status:** post-1.0 follow-on, depends on Phase 9.
**Status:** shipped (2026-05-14). All gate items green.
**Depends on:** Phase 09 (multiplexer + `InFlightRequest` with `InterestedParties` list shape).
**Parallel-safe with:** nothing. The phase modifies `PlcMultiplexer.OnFrame` and the backend reader fan-out path; both are tightly coupled.
@@ -306,3 +306,21 @@ If you're the agent picking up this phase:
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
- Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Function Code Support" and "Coils and Discrete Inputs".
- Counter snapshot backwards-compat policy that this phase respects (additive only): [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
## Clarifications discovered during implementation
These are the implementation details that the original phase doc did not pin down; recorded here so the next reader doesn't relearn them.
1. **`InterestedParties` is a `List<InterestedParty>` cast to `IReadOnlyList`.** Phase 9 typed the field as `IReadOnlyList<InterestedParty>` to leave room for any implementation; Phase 10 specifically requires a mutable list so the map can append parties under its lock. The list is mutated only under `InFlightByKeyMap`'s lock, and the reader's fan-out iterates the list ONLY after the entry has been removed from the map — by that point no further appends are possible. There is no separate snapshot copy.
2. **The factory closure performs the Phase-9 work (allocate TxId + add to CorrelationMap) but does NOT enqueue to the outbound channel.** The channel send happens AFTER returning from `TryAttachOrCreate` so the InFlightByKey lock is not held across a potentially-async send. The factory communicates its allocated proxy TxId and InFlightRequest back to the caller through closure-captured locals. If the allocator is saturated, the factory returns a "stub" InFlightRequest with no CorrelationMap entry; the caller detects this and delivers a Modbus exception 04.
3. **`coalescedHitCount + coalescedMissCount` = total FC03/FC04 requests (always).** Even when coalescing is disabled, every FC03/04 request bumps `coalescedMissCount` from the non-coalescing path. This keeps the math balanced for dashboard consumers regardless of feature state. Writes (FC06/FC16) are NOT in this accounting — they never touch the coalescing path.
4. **Cascade and watchdog paths drain `InFlightByKeyMap` too.** On backend disconnect, `TearDownBackendAsync` calls `_inFlightByKey.DrainAll()` so a brand-new identical request through the freshly-reconnected backend is treated as a miss. On per-request watchdog timeout, `_inFlightByKey.TryRemove(key)` runs alongside the CorrelationMap removal so subsequent identical requests start fresh.
5. **Live config accessor, not `IOptionsMonitor`-by-value.** The multiplexer takes a `Func<ReadCoalescingOptions>` accessor that resolves to `optionsMonitor.CurrentValue.Resilience.ReadCoalescing` per PDU. This keeps the constructor surface lightweight (no DI on `IOptionsMonitor<MbproxyOptions>`) and gives tests a clean way to pin a fixed config. Hot-reload of `Enabled` propagates because the accessor is read on every incoming FC03/FC04 request.
6. **Phase 9's `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` test required a one-line edit.** It asserted ≥2 distinct backend TxIds from two identical FC03 reads — exactly the case Phase 10 now coalesces. The test was patched to use DIFFERENT start addresses so the two reads remain non-coalescable while still proving distinct proxy TxIds. The rest of Phase 9's tests are unaffected.
7. **pymodbus simulator and coalescing.** The simulator's `last_pdu`-overwrite bug (documented in design.md) means we cannot E2E-verify "five concurrent identical reads → 1 backend round-trip" against pymodbus. The headline-stress correctness claim is therefore proven against the stub backend in `ReadCoalescingTests` (real loopback sockets, deterministic 200400 ms response delay so the in-flight window is wide enough for racing requests to actually overlap). The E2E suite verifies counter accounting, status-page surfacing, and the rewriter integration on serialised reads — i.e. the integration boundary, not the concurrency proof.