mbproxy: add in-flight read coalescing (Phase 10)
When two or more upstream clients send the same FC03/FC04 read while a matching request is already in flight on the same PLC's multiplexed backend socket, attach the late arrivals to the existing InFlightRequest .InterestedParties list instead of opening a second backend round-trip. The single backend response fans out to every attached party with each party's original MBAP TxId restored individually. Zero post-response staleness — coalescing operates entirely within the in-flight window (microseconds to ~10 ms typical); the proxy is NOT a cache layer. Headline mechanism: - New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus tables and never share a key; different unit IDs never coalesce; writes (FC06/FC16) bypass the coalescing path entirely. - InFlightByKeyMap uses a simple lock around a Dictionary; atomic TryAttachOrCreate either appends a new party to the in-flight request's mutable List<InterestedParty> or invokes a factory to build a fresh entry. Per-entry MaxParties cap (default 32) bounds fan-out cost; past the cap, the next arrival opens a new entry. - PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The factory closure does the Phase-9 work (allocate TxId, add to CorrelationMap); the channel send happens AFTER returning from TryAttachOrCreate so the map lock is not held across the async send. - Response fan-out in RunBackendReaderAsync removes the entry from InFlightByKeyMap before iterating InterestedParties, ensuring no concurrent attach can mutate the list during iteration. - Cascade + watchdog paths also drain the key map so a stale entry cannot outlive its backend round-trip. Counter accounting balance (per snapshot): CoalescedHitCount + CoalescedMissCount equals total FC03 + FC04 requests since startup. Even with coalescing disabled, every read still bumps Miss so dashboard math stays balanced. New surface (additive only): - src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs - src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs - src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs - ReadCoalescingOptions on ResilienceOptions - CoalescedHitCount / CoalescedMissCount / CoalescedResponseToDeadUpstream counters surfaced on /status.json per PLC and as a compact "Coal" cell on the HTML status page. Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire previously read the same register from both clients (which now coalesces). Patched to read two different addresses so the test still proves distinct backend TxIds without violating the coalescing contract. Tests added: 24 new (19 unit + 5 E2E): - CoalescingKeyTests (5) - InFlightByKeyMapTests (6, includes concurrent stress) - ReadCoalescingTests (8, stub-backend with deterministic delay) - ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active during overlap is proven against the stub, not the sim, due to pymodbus 3.13's known concurrent-frame bug) Total: 325 tests passing (282 unit + 43 E2E). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -2,7 +2,7 @@
|
||||
|
||||
When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.
|
||||
|
||||
**Status:** post-1.0 follow-on, depends on Phase 9.
|
||||
**Status:** shipped (2026-05-14). All gate items green.
|
||||
**Depends on:** Phase 09 (multiplexer + `InFlightRequest` with `InterestedParties` list shape).
|
||||
**Parallel-safe with:** nothing. The phase modifies `PlcMultiplexer.OnFrame` and the backend reader fan-out path; both are tightly coupled.
|
||||
|
||||
@@ -306,3 +306,21 @@ If you're the agent picking up this phase:
|
||||
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
|
||||
- Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Function Code Support" and "Coils and Discrete Inputs".
|
||||
- Counter snapshot backwards-compat policy that this phase respects (additive only): [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
|
||||
|
||||
## Clarifications discovered during implementation
|
||||
|
||||
These are the implementation details that the original phase doc did not pin down; recorded here so the next reader doesn't relearn them.
|
||||
|
||||
1. **`InterestedParties` is a `List<InterestedParty>` cast to `IReadOnlyList`.** Phase 9 typed the field as `IReadOnlyList<InterestedParty>` to leave room for any implementation; Phase 10 specifically requires a mutable list so the map can append parties under its lock. The list is mutated only under `InFlightByKeyMap`'s lock, and the reader's fan-out iterates the list ONLY after the entry has been removed from the map — by that point no further appends are possible. There is no separate snapshot copy.
|
||||
|
||||
2. **The factory closure performs the Phase-9 work (allocate TxId + add to CorrelationMap) but does NOT enqueue to the outbound channel.** The channel send happens AFTER returning from `TryAttachOrCreate` so the InFlightByKey lock is not held across a potentially-async send. The factory communicates its allocated proxy TxId and InFlightRequest back to the caller through closure-captured locals. If the allocator is saturated, the factory returns a "stub" InFlightRequest with no CorrelationMap entry; the caller detects this and delivers a Modbus exception 04.
|
||||
|
||||
3. **`coalescedHitCount + coalescedMissCount` = total FC03/FC04 requests (always).** Even when coalescing is disabled, every FC03/04 request bumps `coalescedMissCount` from the non-coalescing path. This keeps the math balanced for dashboard consumers regardless of feature state. Writes (FC06/FC16) are NOT in this accounting — they never touch the coalescing path.
|
||||
|
||||
4. **Cascade and watchdog paths drain `InFlightByKeyMap` too.** On backend disconnect, `TearDownBackendAsync` calls `_inFlightByKey.DrainAll()` so a brand-new identical request through the freshly-reconnected backend is treated as a miss. On per-request watchdog timeout, `_inFlightByKey.TryRemove(key)` runs alongside the CorrelationMap removal so subsequent identical requests start fresh.
|
||||
|
||||
5. **Live config accessor, not `IOptionsMonitor`-by-value.** The multiplexer takes a `Func<ReadCoalescingOptions>` accessor that resolves to `optionsMonitor.CurrentValue.Resilience.ReadCoalescing` per PDU. This keeps the constructor surface lightweight (no DI on `IOptionsMonitor<MbproxyOptions>`) and gives tests a clean way to pin a fixed config. Hot-reload of `Enabled` propagates because the accessor is read on every incoming FC03/FC04 request.
|
||||
|
||||
6. **Phase 9's `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` test required a one-line edit.** It asserted ≥2 distinct backend TxIds from two identical FC03 reads — exactly the case Phase 10 now coalesces. The test was patched to use DIFFERENT start addresses so the two reads remain non-coalescable while still proving distinct proxy TxIds. The rest of Phase 9's tests are unaffected.
|
||||
|
||||
7. **pymodbus simulator and coalescing.** The simulator's `last_pdu`-overwrite bug (documented in design.md) means we cannot E2E-verify "five concurrent identical reads → 1 backend round-trip" against pymodbus. The headline-stress correctness claim is therefore proven against the stub backend in `ReadCoalescingTests` (real loopback sockets, deterministic 200–400 ms response delay so the in-flight window is wide enough for racing requests to actually overlap). The E2E suite verifies counter accounting, status-page surfacing, and the rewriter integration on serialised reads — i.e. the integration boundary, not the concurrency proof.
|
||||
|
||||
Reference in New Issue
Block a user