Files
wwtools/mbproxy/docs/Architecture/ResponseCache.md
T
Joseph Doherty f49e27e316 mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:44:34 -04:00

19 KiB

Response Cache

The response cache is an opt-in per-tag, bounded-staleness layer that serves FC03 and FC04 reads from in-process memory. It sits above read coalescing in the request path so a hit avoids both the coalescing entry and the backend round-trip entirely.

Cache Contract

The cache is off by default for every tag. CacheTtlMs = 0 on every BCD tag is the default state, and a deployment that ships without any TTL configuration behaves identically to one compiled without the cache at all — no in-memory entries are created, every FC03/FC04 read falls through to the coalescing-then-backend path, and counters that track cache activity stay at zero.

Operators opt a tag in by setting a positive CacheTtlMs. That positive value is the explicit acknowledgement of the staleness window: the operator is stating, "I am willing for upstream clients to see a value up to N milliseconds old in exchange for taking the read off the backend." There is no implicit cache enablement. There is no global cache toggle that turns caching on for previously-uncached tags. Every cached tag is one whose configuration has a positive TTL on its line.

This stance is the design-contract pivot the cache introduces: before it, the proxy is purely transparent except for BCD rewriting. With the cache, the proxy is transparent by default, with an opt-in cache layer the operator can engage tag-by-tag.

TTL Resolution Order

Each FC03/FC04 read range resolves to one effective TTL through three tiers:

  1. Explicit per-tag. BcdTagOptions.CacheTtlMs on the tag entry. A non-null value wins regardless of the per-PLC default. An explicit 0 here disables caching for that tag even when the PLC default is positive.
  2. Per-PLC default. PlcOptions.DefaultCacheTtlMs applies to any tag whose explicit CacheTtlMs is null (unset). A 0 default means "no caching by default at this PLC."
  3. Zero. With nothing set at either tier, the resolved TTL is 0 and the read is uncached.

BcdTagMap.ResolveCacheTtlMs(startAddress, qty) implements the per-read resolution. It enumerates the BCD tags whose register footprints intersect the requested range and returns the smallest positive TTL across the hits, or 0 if the range covers no configured tags.

public int ResolveCacheTtlMs(ushort startAddress, ushort qty)
{
    if (!TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0)
        return 0;

    int min = int.MaxValue;
    foreach (var hit in hits)
    {
        int ttl = hit.Tag.CacheTtlMs;
        if (ttl <= 0) return 0;
        if (ttl < min) min = ttl;
    }
    return min == int.MaxValue ? 0 : min;
}

The hit.Tag.CacheTtlMs value resolved on each BcdTag already reflects the explicit-then-default order — the options binder resolves the per-tag override against the per-PLC default at config build time, so the runtime hot path sees a single integer per tag.

Multi-Tag Range TTL Rule

When a single FC03/FC04 read covers multiple configured BCD tags, the effective TTL is the minimum across them:

range covers tags { A:TTL=500, B:TTL=2000, C:TTL=100 } → effective TTL = 100
range covers tags { A:TTL=500, B:TTL=0 (uncached)    } → effective TTL = 0
range covers tags { A:TTL=500 }                        → effective TTL = 500
range covers no configured tags                        → effective TTL = 0

If any covered tag has CacheTtlMs = 0, the whole read is uncached. The rationale is conservative-by-design: a multi-tag read whose narrowest TTL is, for example, 100 ms cannot be served safely from an entry that was stored under a tag with TTL 2 s, because that entry's freshness was only guaranteed by the longer window. Rather than partition a range read across heterogeneous TTLs or invent inheritance rules that an operator would have to reason about per-deployment, the cache refuses to serve any multi-tag read whose narrowest covered TTL is zero. Operators who want a tag cached in isolation but uncached when read alongside an uncached neighbour get the expected behaviour by leaving the neighbour at CacheTtlMs = 0.

A read whose range covers no configured BCD tags also resolves to 0. There is nothing to be conservative about because the cache only serves ranges that contain rewriter-tracked tags — a read of plain non-BCD registers does not engage the cache regardless of any per-PLC default.

Lookup Order

The multiplexer's FC03/FC04 path consults three tiers in fixed order:

  1. Cache. When _ctx.Cache is wired and BcdTagMap.ResolveCacheTtlMs returns a positive TTL for the read range, ResponseCache.TryGet is called against a CacheKey(unitId, fc, startAddress, qty). A hit splices the cached payload onto a fresh MBAP header carrying the original upstream TxId, pushes the frame onto that pipe's response channel, and returns without engaging coalescing or the backend at all.
  2. Coalesce. On a cache miss (or when the resolved TTL is zero), the request is offered to InFlightByKeyMap.TryAttachOrCreate. A hit attaches the new party to a peer's in-flight request.
  3. Backend. On a coalescing miss, the request opens a proxy TxId, registers a CorrelationMap entry, runs the BCD rewriter on any FC06 or FC16 payload, and queues the frame onto the outbound channel.

The cache check happens before the multiplexer's EnsureBackendConnectedAsync call. A cache hit serves the upstream even when the backend socket is currently disconnected or recovering. This is not an accident — the cached payload's freshness is bounded by its TTL, not by the liveness of the backend socket. See ../Operations/Troubleshooting.md for the operator view of cache-served reads during a backend outage.

Storage Format: Post-Rewriter Bytes

CacheEntry.PduBytes holds the post-rewriter response PDU body — the function code byte, the byte count, and the rewriter-decoded register data, with no MBAP header. The backend reader task decodes the response through BcdPduPipeline first and only then hands the rewritten payload to ResponseCache.Set.

internal sealed record CacheEntry(
    byte[] PduBytes,
    DateTimeOffset CachedAtUtc,
    DateTimeOffset ExpiresAtUtc,
    int Length,
    long LastUsedTick);

Storing post-rewriter bytes is both a CPU optimisation and a correctness guarantee:

  • CPU. A cache hit returns ready-to-send bytes. The rewriter does not re-run per hit; only the MBAP header is regenerated to carry the upstream's original TxId.
  • Correctness. An entry decoded against an earlier rewriter version never gets retroactively re-transformed against a newer version. If the rewriter's behaviour changes mid-process (it does not today, but the guarantee is durable across future changes), in-flight cached entries age out under their TTL and are replaced by fresh entries decoded through the new rewriter. A bidirectional re-encode never happens to an already-stored entry.

Write Invalidation by Address Range Overlap

A successful (non-exception) FC06 or FC16 response invalidates every cached FC03 or FC04 entry whose address range [StartAddress, StartAddress + Qty) overlaps the write range [writeStart, writeStart + writeQty). The pure overlap math lives in CacheInvalidator.FindOverlapping:

int writeEnd = writeStart + writeQty;   // half-open upper bound

foreach (var key in haystack)
{
    if (key.UnitId != unitId) continue;
    if (key.Fc != 0x03 && key.Fc != 0x04) continue;

    int keyEnd = key.StartAddress + key.Qty;
    // Overlap iff writeStart < keyEnd AND key.StartAddress < writeEnd.
    if (writeStart < keyEnd && key.StartAddress < writeEnd)
        hits.Add(key);
}

Worked examples on a single unit ID:

Write to register 105 (qty=1)
  └─ invalidates cached FC03 [100..110) — register 105 is inside the cached range
  └─ leaves    cached FC03 [200..210) untouched

Write to registers [10..15) (qty=5)
  └─ leaves    cached FC03 [15..20) untouched — half-open intervals, 15 is not in [10..15)

Write to registers [98..108) (qty=10)
  └─ invalidates cached FC03 [100..110) — ranges overlap on [100..108)

Three properties of the invalidator deserve calling out:

  • Exception responses do not invalidate. A Modbus exception (code 01, 02, 03, 04, or any other) means the write did not take effect on the PLC. The cached read is still consistent with the device, so the invalidator is not engaged.
  • Different unit IDs never invalidate each other. Multi-drop and gateway personalities behind a shared socket address logically separate Modbus tables. CacheKey.UnitId discriminates.
  • Only FC03 and FC04 entries are evicted. The cache never stores write responses, so the invalidator's function-code filter is defensive rather than load-bearing.

Bounded Capacity (LRU)

Each ResponseCache instance is capped at Cache.MaxEntriesPerPlc (default 1000). When the dictionary is at the cap and a fresh insert arrives, EvictLeastRecentlyUsed walks the entries and removes the one with the smallest CacheEntry.LastUsedTick. The linear scan is intentional — at 1000 entries the scan is cheaper than the network round-trip the cache is saving, and a sorted secondary structure would add complexity for no measurable win.

LastUsedTick is a monotonic 64-bit counter incremented on every hit and every fresh insert. Using the counter rather than DateTimeOffset.UtcNow keeps the hot path free of clock calls and survives wall-clock skew.

A background task drives proactive expiry. The constructor starts a PeriodicTimer at Cache.EvictionIntervalMs (default 5000 ms; values under 100 ms are clamped at 100 ms to prevent tight loops) and the eviction loop sweeps every entry whose ExpiresAtUtc has passed. The loop is the safety net that keeps abandoned entries — say, those for a PLC whose upstream clients have all dropped — from holding memory until process exit. Lazy expiry on TryGet still removes entries on demand when traffic is steady; the background loop only matters under low- or zero-traffic conditions.

Long-TTL Safety Gate

MbproxyOptionsValidator.ValidateCacheTtl rejects any explicit CacheTtlMs > 60_000 unless Cache.AllowLongTtl = true. The same gate applies to PlcOptions.DefaultCacheTtlMs. The rejection runs at config bind / hot-reload time, so a misconfigured appsettings.json fails fast before the cache sees the value.

The gate exists to catch the "left at 1 hour by accident" mistake — a deployment where a developer set CacheTtlMs = 3_600_000 for a debugging session and the value survived into production. Operators who legitimately need long TTLs (slow-moving setpoints, configuration values that change once per shift) flip Cache.AllowLongTtl to true as the explicit acknowledgement that the long staleness window is intentional.

Cache and the Rewriter

The BCD rewriter runs once on the cache-miss path: the backend reader task decodes the response through BcdPduPipeline and only then hands the decoded bytes to ResponseCache.Set. Cache hits return the stored post-rewriter bytes directly.

This division has two consequences worth restating:

  • The rewriter cost is amortised across hits. A high cache hit ratio on a tag-dense PLC drops the per-request rewriter cost from "every response" to "every cache-miss response," which on a hot register at TTL=500 ms is one-in-many.
  • The cached payload is decoupled from the rewriter implementation. An entry stored under one rewriter does not get re-transformed if the rewriter changes. Entries age out under TTL and are replaced by fresh entries decoded under the current rewriter — there is no in-place recomputation pass.

Hot-Reload Semantics

Configuration changes propagate through IOptionsMonitor<MbproxyOptions>. The cache reacts to four kinds of change:

Change Cache behaviour
Tag's CacheTtlMs changed (0 → N, N → 0, N → M) Entire PLC cache is flushed via ResponseCache.Clear(); entries re-populate on demand under the new TTL.
New PLC added / removed New PLC starts with an empty cache; removed PLC's ResponseCache is disposed with the multiplexer.
Cache.AllowLongTtl flipped Validation runs on the next reload only; existing entries are unaffected.
Cache.MaxEntriesPerPlc changed Existing entries are unaffected; the new cap applies to subsequent inserts.
Cache.EvictionIntervalMs changed Existing eviction loop continues with its old period; subsequent loops use the new interval.

Per-tag flush granularity is intentionally not implemented. The clean move is "any tag-list change to a PLC → drop every entry for that PLC and let the natural traffic re-populate." Tracking which keys correspond to which tag IDs adds bookkeeping for no operational win — a tag-list reload is already a once-in-a-while event, and the rebuild cost on the affected PLC's hot keys is one round-trip per key under traffic.

See ../Features/HotReload.md for the broader IOptionsMonitor propagation model.

Cache Survives Backend Disconnects

A cached entry's data was valid when stored. A subsequent backend disconnect does not retroactively invalidate it — the value the upstream client sees on a hit is the value the PLC reported within the TTL window, irrespective of whether the backend socket is up at the moment of the hit. This is the cache's most operationally visible property during PLC outages: upstream consumers that read hot tags within the cache window continue to receive responses while the listener supervisor is in recovering state.

The companion rule on the write side keeps the invariant consistent: invalidations during a recovering listener state are skipped. If the backend is down, an FC06 or FC16 write did not reach the PLC, so the cached read is still consistent with the device's actual state. Skipping the invalidation matches reality — the write did not take effect, so the read is not stale.

No Persistence

The cache is purely in-memory. Process restart wipes every entry. There is no file-backed snapshot, no Redis or other external store, and no last-known-good replay. A restarted service rebuilds its cache from fresh backend round-trips driven by upstream traffic, exactly as it would after a TTL-induced flush.

Intentional, for two reasons. First, the staleness contract is bounded by CacheTtlMs measured from when the data was first read, and a persisted entry would re-emerge with an unknown wall-clock age — every invariant the cache offers would need a freshness field, freshness arithmetic on load, and recovery against a clock that may have jumped. Second, the operational model is that the proxy is a stateless transformer; treating its cache as durable state would change the deployment story for no measurable production benefit.

Counter Accounting

ProxyCounters exposes five cache counters per PLC, surfaced on the status page as both per-PLC and fleet-aggregate values:

  • cacheHitCount — FC03/FC04 requests served from the cache. Bumped inside OnUpstreamFrameAsync when ResponseCache.TryGet returns true.
  • cacheMissCount — FC03/FC04 requests whose resolved TTL was positive but whose key was not in the cache (or whose entry had expired). The identity cacheHitCount + cacheMissCount = total cache-eligible FC03/FC04 requests holds — reads whose effective TTL is 0 (uncached) increment neither counter.
  • cacheHitRatio — derived on the status page snapshot as cacheHitCount / (cacheHitCount + cacheMissCount) when the denominator is non-zero.
  • cacheInvalidations — count of cache entries invalidated by successful FC06/FC16 write responses, summed across writes.
  • cacheEntryCount — point-in-time snapshot of ResponseCache.Count (Tier-2 memory-watch KPI).
  • cacheBytes — point-in-time approximation of cached PDU bytes, computed as the running sum of CacheEntry.Length across entries (Tier-2 memory-watch KPI).

The structured log events mbproxy.cache.hit, mbproxy.cache.miss, mbproxy.cache.store, mbproxy.cache.invalidated, and mbproxy.cache.flushed (defined in CacheLogEvents) mirror the counter increments at Debug level for incident-time diagnosis. Counters are the steady-state observability surface; the events are for tracing one request through the cache when something looks wrong. See ../Operations/StatusPage.md and ../Reference/LogEvents.md.

Design-Contract Note

The cache changes the proxy's posture from "purely transparent except for BCD rewriting" to "transparent by default, with an opt-in cache layer." The transition is deliberate and operator-driven: setting CacheTtlMs > 0 on a tag is the explicit consent to the staleness window, and a deployment that ships no positive TTLs is observationally indistinguishable from one compiled without the cache code path.

There is no global switch, no implicit warm-up, and no behavioural divergence from the transparent baseline until the operator opts in tag-by-tag. The cache is the only place in the proxy where an upstream read can resolve to a value that did not just round-trip the wire, and its engagement is gated entirely by the per-tag and per-PLC TTL configuration described above.