Files
wwtools/mbproxy/docs/plan/10-read-coalescing.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

21 KiB
Raw Blame History

Phase 10 — Read coalescing (in-flight only, zero staleness)

When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.

Status: post-1.0 follow-on, depends on Phase 9. Depends on: Phase 09 (multiplexer + InFlightRequest with InterestedParties list shape). Parallel-safe with: nothing. The phase modifies PlcMultiplexer.OnFrame and the backend reader fan-out path; both are tightly coupled.

Goal

Phase 9's multiplexer routes every upstream request individually, even when two upstream clients are asking for identical data. In a fleet of 54 PLCs where the HMI, historian, and engineering workstation all poll the same screen tags every second, that's up to 3× redundant backend traffic per overlapping read — and the H2-ECOM100's single-request-per-scan internal serialization means redundant traffic compounds into measurable backend latency.

Phase 10 detects same-key reads within the in-flight window and serves them from a single backend response. Coalescing operates entirely between "first request sent to backend" and "response received from backend." Once the response is fanned out, the coalescing entry dies. No values are held past the response arrival; no invalidation logic; no design-doc change to the "not a polling/cache layer" stance.

Why this is safe — the zero-staleness argument

A coalesced response is a value the backend was going to return to the first request anyway. By the time the second client's request arrives, the first request is already on the wire to the PLC. The PLC's response represents the register values at the moment the PLC serviced the request. Even if the second request had been sent separately on its own backend round-trip, the H2-ECOM100's internal serialization would have queued it behind the first, returning the same value (or a value as old as one extra PLC scan ≈ 2-10 ms older).

In other words: the only thing Phase 10 changes is whether the proxy sends one or two requests to the PLC. The answer the upstream clients see is identical (or fresher than the "two requests" alternative, since coalescing means the second client doesn't wait for a second backend round-trip).

Outputs (new files in this phase)

src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs              # readonly record struct
src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs           # ConcurrentDictionary wrapper with atomic attach-or-create
src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs        # [LoggerMessage] vocab for this phase

tests/Mbproxy.Tests/Proxy/Multiplexing/CoalescingKeyTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/InFlightByKeyMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingE2ETests.cs

Files modified (existing files in this phase)

src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs       # OnFrame learns coalescing path; reader fans out
src/Mbproxy/Proxy/ProxyCounters.cs                     # new: CoalescedHitCount, CoalescedMissCount, CoalescedResponseToDeadUpstream
src/Mbproxy/Options/ResilienceOptions.cs               # new: ReadCoalescing sub-options
src/Mbproxy/Admin/StatusDto.cs                         # PlcBackendStatus gains coalescing fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs             # populate new fields
src/Mbproxy/Admin/StatusHtmlRenderer.cs                # show coalescing ratio in per-PLC row

docs/design.md                                         # Rewriter section: note FC03/04 may be coalesced before reaching backend
docs/kpi.md                                            # graduate "coalescing ratio" KPI from future to supported
install/mbproxy.config.template.json                   # add the new Resilience.ReadCoalescing section with comments

InFlightRequest.cs does not change — the InterestedParties list shape was specifically introduced in Phase 9 to make this phase additive.

Tasks

10.1 Data types

  1. CoalescingKeyreadonly record struct CoalescingKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty). Hash key for the in-flight-by-key map. Auto-generated record-struct equality. Verify hashcode distribution is reasonable for typical V-memory address ranges (smoke-test in unit tests).

  2. InFlightByKeyMap — wraps ConcurrentDictionary<CoalescingKey, InFlightRequest> plus a small lock for atomic attach-or-create. Methods:

    • bool TryAttachOrCreate(CoalescingKey key, InterestedParty party, Func<InFlightRequest> factory, int maxParties, out InFlightRequest req, out bool wasNew) — atomic: if the key exists and req.InterestedParties.Count < maxParties, append the party to a freshly-built IReadOnlyList<InterestedParty> (since the record is immutable, we substitute a new InFlightRequest with the extended list in the map) and return (wasNew=false); else call factory to build a new entry, store it, return (wasNew=true).
    • bool TryRemove(CoalescingKey key, out InFlightRequest req) — called by the backend reader after fan-out completes.
    • The "attach to existing" path is the load-bearing concurrency primitive of this phase. The simpler implementation: small lock around the attach branch. The lock-free implementation uses AddOrUpdate with a comparand check. Pick the simpler one; document the choice in code.

10.2 Multiplexer integration

  1. Request path in PlcMultiplexer.OnFrame:

    bool coalesceCandidate = (fc is 0x03 or 0x04)
                          && resilienceOptions.CurrentValue.ReadCoalescing.Enabled;
    if (coalesceCandidate)
    {
        var key   = new CoalescingKey(unitId, fc, startAddr, qty);
        var party = new InterestedParty(upstreamPipe, originalTxId);
    
        InFlightRequest? req;
        bool wasNew;
        inFlightByKey.TryAttachOrCreate(
            key, party,
            factory: () => BuildAndRegisterNew(unitId, fc, startAddr, qty, party),
            maxParties: resilienceOptions.CurrentValue.ReadCoalescing.MaxParties,
            out req, out wasNew);
    
        if (!wasNew)
        {
            counters.IncrementCoalescedHit();
            return;   // do NOT send to backend — first request will get the response
        }
        counters.IncrementCoalescedMiss();
        // fall through: factory already allocated proxyTxId + added to correlation map + sent
        return;
    }
    
    // FC06/FC16 or coalescing disabled: existing Phase 9 path (allocate, register, send).
    

    The factory closure does the existing Phase 9 work (TxId allocate, correlation map add, MBAP rewrite, send to outbound channel). The new code only adds the "is this already in-flight?" check before that work.

  2. Response fan-out in the backend reader task — already shaped correctly by Phase 9; this phase just makes sure the CoalescingKey matching the response is also removed from InFlightByKeyMap alongside the CorrelationMap removal:

    if (correlationMap.TryRemove(proxyTxId, out var req))
    {
        txIdAllocator.Release(proxyTxId);
    
        // Also clear the coalescing key so a new identical request after this point starts fresh.
        var key = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
        inFlightByKey.TryRemove(key, out _);
    
        // Phase 9's fan-out loop — already iterates InterestedParties.
        foreach (var party in req.InterestedParties)
        {
            if (!party.Pipe.IsAlive)
            {
                counters.IncrementCoalescedResponseToDeadUpstream();
                continue;
            }
            var partyFrame = WithTxId(responseFrame, party.OriginalTxId);
            party.Pipe.SendResponse(partyFrame);
        }
    }
    

10.3 Configuration

  1. Extend ResilienceOptions:

    public sealed class ReadCoalescingOptions
    {
        public bool Enabled    { get; init; } = true;
        public int  MaxParties { get; init; } = 32;
    }
    
    public sealed class ResilienceOptions
    {
        public RetryProfile          BackendConnect    { get; init; } = new();
        public RecoveryProfile       ListenerRecovery  { get; init; } = new();
        public ReadCoalescingOptions ReadCoalescing    { get; init; } = new();   // ← new
    }
    

    Hot-reloadable via the existing IOptionsMonitor<MbproxyOptions> wiring. Disabling Enabled at runtime means new requests take the non-coalescing path; existing in-flight coalesced entries drain naturally.

  2. mbproxy.config.template.json update — add a commented ReadCoalescing block to the install template under Resilience with the two new keys, default values, and a one-paragraph explanation.

10.4 Counters and status surfacing

  1. ProxyCounters additions:

    public void IncrementCoalescedHit();
    public void IncrementCoalescedMiss();
    public void IncrementCoalescedResponseToDeadUpstream();
    

    CounterSnapshot gains CoalescedHitCount, CoalescedMissCount, CoalescedResponseToDeadUpstream — all long, all Interlocked. The status page derives coalescingRatio = Hit / (Hit + Miss) for display; the raw counts are exposed in JSON for downstream tooling.

  2. /status.json per-PLC fields — extend PlcBackendStatus:

    public sealed record PlcBackendStatus(
        long ConnectsSuccess, long ConnectsFailed,
        ExceptionCounts ExceptionsByCode,
        double LastRoundTripMs,
        long CoalescedHitCount,                          // ← new
        long CoalescedMissCount,                         // ← new
        long CoalescedResponseToDeadUpstream);           // ← new
    
  3. HTML page — extend the per-PLC row with a compact Coal: 73% cell (hit / (hit+miss) * 100, rounded). Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.

10.5 Documentation

  1. docs/design.md Rewriter section: add a paragraph clarifying that FC03/FC04 requests may be coalesced with other in-flight requests of the same (unitId, fc, start, qty) before reaching the backend. Emphasize that the transparency contract holds — each client sees its own original TxId restored on the response, and the response value is identical to what an uncoalesced request would have returned (within the PLC's scan-time precision).

  2. docs/kpi.md Tier 1: the new coalescedHitCount, coalescedMissCount, derived coalescingRatio graduate from "future" to "supported" Tier 1 fields. Mention the coalescedResponseToDeadUpstream counter as a low-priority Tier 2 informational metric.

Public surface declared in this phase

namespace Mbproxy.Proxy.Multiplexing;

internal readonly record struct CoalescingKey(
    byte UnitId, byte Fc, ushort StartAddress, ushort Qty);

internal sealed class InFlightByKeyMap
{
    public bool TryAttachOrCreate(
        CoalescingKey key,
        InterestedParty party,
        Func<InFlightRequest> factory,
        int maxParties,
        out InFlightRequest req,
        out bool wasNew);
    public bool TryRemove(CoalescingKey key, out InFlightRequest req);
    public int Count { get; }
}
namespace Mbproxy.Options;

public sealed class ReadCoalescingOptions
{
    public bool Enabled    { get; init; } = true;
    public int  MaxParties { get; init; } = 32;
}
// Added field on existing ResilienceOptions:
public ReadCoalescingOptions ReadCoalescing { get; init; } = new();

ProxyCounters and CounterSnapshot gain three new long fields. No public-surface removals, no renames.

Tests required

Unit (Category = Unit)

CoalescingKeyTests (≥ 4 tests):

  1. Equality_OnIdenticalKeys_ReturnsTrue
  2. Equality_OnDifferentFc_ReturnsFalse — FC03 vs FC04 with same start/qty/unit are NOT equal (different Modbus tables).
  3. Equality_OnDifferentUnitId_ReturnsFalse
  4. HashCode_DistributionSanity — build 10,000 randomly-generated keys, bucket by Key.GetHashCode() & 0xFF, assert no bucket has > 5 % of total (rough uniformity check).

InFlightByKeyMapTests (≥ 6 tests):

  1. TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue
  2. TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse
  3. TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend — refuses to fan out beyond the cap; preserves backend-load-shedding guarantee.
  4. TryRemove_AfterAttach_AllPartiesPresent_InRetrievedEntry
  5. TryRemove_OfMissing_ReturnsFalse
  6. Concurrent_AttachOrCreate_From_Two_Threads_NoLostParties_AndNoDuplicateEntries — 100 tasks × 1000 ops each.

ReadCoalescingTests (≥ 7 tests, real sockets, stub backend):

  1. TwoClients_SameRequest_OnlyOneBackendRoundTrip — stub backend counts received requests; assert 1.
  2. TwoClients_DifferentRequests_BothHitBackend — different start addresses; assert 2.
  3. FiveClients_SameRequest_OneBackendRoundTrip_FiveResponses — fan-out works correctly with 5 attached parties.
  4. FC03_And_FC04_SameAddress_NOT_Coalesced — different tables.
  5. FC06_Write_NeverCoalesced — writes always allocate their own TxId.
  6. OneClient_DisconnectsMidFlight_OthersStillGetResponse_AndDeadUpstreamCounterIncrements
  7. AtMaxParties_NextRequest_StartsFreshBackendRoundTrip — verify the cap behaviour: when MaxParties = 2 and 3 simultaneous clients send the same request, the third opens a new in-flight entry rather than joining the first.

E2E (Category = E2E)

ReadCoalescingE2ETests (≥ 5 tests, against pymodbus simulator, [Collection(nameof(DL205SimulatorCollection))]):

  1. E2E_FiveConcurrentClients_SameReadHR1072_CoalescedHitCount_AtLeast_3 — five NModbus clients connect to the proxy, simultaneously read HR1072 (BCD-configured). Assert coalescedHitCount >= 3 (race wiggle room — perfect coalescing would give 4 hits, but the racy first-arrivals can both miss).
  2. E2E_RewriterStillWorks_ForAllCoalescedParties — same setup, but with BCD tag at 1072. All five clients receive decoded 1234. Proves the rewriter sees a coalesced response correctly and the TxId restoration doesn't perturb the BCD bytes.
  3. E2E_DifferentRegisters_NotCoalesced_CoalescedHitCount_Zero — five clients reading five different addresses; assert no coalescing happened.
  4. E2E_StatusPage_Shows_CoalescingRatio/status.json for the test PLC has populated coalescedHitCount and coalescedMissCount after the burst.
  5. E2E_DisableViaHotReload_RevertToPhase9Behaviour — write a temp appsettings with ReadCoalescing.Enabled = false, hot-reload, verify subsequent identical reads each hit the backend separately (counter doesn't increment).

Phase gate

  • dotnet build Mbproxy.slnx -c Debug — zero warnings, zero errors.
  • All prior tests still green — specifically the 4 critical Phase-9 regression guards:
    • Forward_FC03_HR1072_Returns_Decoded_1234
    • Forward_FC06_WriteHR200_ThenReadBack_RoundTrips
    • Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips
    • MbapTxId_IsPreservedEndToEnd
  • All new unit + e2e tests pass (≥ 17 new).
  • Headline assertion: 5 concurrent FC03 reads of the same register through the proxy produce at most 2 backend round-trips (allowing one race for the initial pair). Verifiable via stub-backend's request counter in ReadCoalescingTests.
  • FC04 reads of the same address as a coexisting FC03 stream do NOT coalesce together. Verified by an explicit test.
  • FC06 / FC16 writes are NEVER on the coalescing path. Verified by setting MaxParties = 1 and confirming write throughput is unaffected.
  • Coalescing-ratio counter ≥ 50 % under the headline stress test (5 simultaneous identical reads).
  • Disabling coalescing via Mbproxy.Resilience.ReadCoalescing.Enabled = false hot-reloads cleanly; running coalesced entries drain naturally without errors.
  • docs/design.md Rewriter section mentions the coalescing path; docs/kpi.md Tier 1 includes the new fields; install/mbproxy.config.template.json includes the new commented Resilience.ReadCoalescing block.
  • HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).

Out of scope

  • Post-response caching — no TTL, no staleness window beyond "while the request is in flight." This phase is strictly in-flight. A response-cache phase would be a separate plan (Phase 11+) and would require the design.md "not a cache layer" stance to be revisited and rewritten.
  • Range-overlap coalescing — request A reading [100..110], request B reading [105..115]. Different keys; no coalescing. Range-overlap detection is a separate optimisation with its own algorithmic complexity (interval trees, etc.) and its own staleness questions (request B's response would include reg 100..104 from A's perspective, but those weren't in B's response).
  • Cross-PLC coalescing — each PLC's multiplexer has its own key map. No optimization across PLCs (their backend connections are independent anyway).
  • Write coalescing / batching — different problem with non-idempotency concerns. The design doc's "no mid-request retry on writes" principle extends to "no write coalescing."
  • Predictive batching — combining a single client's likely-next read into the current request. Out of scope; speculative reads are a different optimization category.
  • Adaptive MaxParties — staying at the configured value. Auto-tuning is interesting but speculative.

Subagent briefing

If you're the agent picking up this phase:

  1. Phase 9's InterestedParties list is the seam. This phase only adds the "look up the key, attach a new party to an existing entry" logic. The fan-out side already iterates the list correctly. If you find yourself rewriting Phase 9's response path, you've drifted out of scope.

  2. CoalescingKey includes UnitId. DL260 fleets typically use unit 1, but we don't assume — different unit IDs are different PLC personalities behind the same TCP socket and must not coalesce.

  3. FC03 and FC04 are different tables. Same register address space in DL series, but Modbus treats them separately. Different CoalescingKey for the same address; no coalescing across them.

  4. Coalescing is best-effort under races. Two simultaneous identical requests can both miss the map and create separate entries — counter just shows a lower ratio. Not a bug; documented behaviour. Do not over-engineer with double-checked locking.

  5. MaxParties is the load-shedding safety valve. If a thousand HMI panels all attach to one in-flight request, the response fan-out cost goes linear with attachment count and stalls the backend reader task. Cap at 32 by default. Past the cap, route through a fresh entry — fan-out cost per entry is bounded.

  6. The attach-or-create operation MUST be atomic per key. Two simultaneous arrivals must not both create new entries for the same key (would defeat coalescing). The simpler implementation: lock(map.SyncRoot) around the attach branch. The lock-free implementation uses AddOrUpdate with the updateFactory checking the count cap. Pick whichever you can write correctly in 30 minutes; document the choice.

  7. Response fan-out must check Pipe.IsAlive per party. An upstream client that disconnects between attaching and the response arriving — count it as CoalescedResponseToDeadUpstream and continue with the others. Do not throw, do not log per-occurrence at Information (would be too noisy under client churn).

  8. Hot-reload of Enabled doesn't disrupt in-flight entries. Disabling the feature mid-flight just means subsequent requests take the non-coalescing path. Existing coalesced entries drain when their response arrives. Don't try to "flush" them on the reload event.

  9. CoalescedHit + CoalescedMiss = total FC03+FC04 requests. The math has to balance per snapshot. Use Interlocked.Increment exclusively. Disabling coalescing means every FC03/04 request becomes a Miss (which is fine — the metric still tracks total reads).

  10. Update design.md AND kpi.md AND the install template in the same PR as the code. Doc drift is a gate failure. The coalescing-ratio KPI specifically graduates from "future" to "Tier 1 supported" — make that promotion explicit in kpi.md.

Cross-references

  • Phase 9's multiplexer is the foundation. The InterestedParty and InterestedParties types live there: 09-txid-multiplexing.md.
  • KPI graduation target: ../kpi.md → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
  • Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: ../../DL260/dl205.md → "Function Code Support" and "Coils and Discrete Inputs".
  • Counter snapshot backwards-compat policy that this phase respects (additive only): ../kpi.md → "Backwards-compat policy".