mbproxy: add opt-in response cache (Phase 11)

Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.

Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).

New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
  the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
  loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
  mbproxy.cache.* names).

Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).

Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).

Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.

Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.

PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.

Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
  overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
  unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
  dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
  coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
  overlapping, non-overlapping write does not invalidate, multi-tag
  TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
  works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
  * Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
  * TTL expiry path with sleep > TTL.
  * Write invalidation through the proxy on a scratch register.
  * BCD-decoded bytes are cached, not raw BCD nibbles.
  * Regression: Cache disabled by default -> behaviour byte-identical to
    Phase 10.

Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.

Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 03:08:51 -04:00
parent 892b10baf4
commit 1db900edef
33 changed files with 2407 additions and 38 deletions
+40
View File
@@ -365,6 +365,46 @@ If you're the agent picking up this phase:
11. **Update `docs/design.md` AND `docs/kpi.md` AND `mbproxy/CLAUDE.md` AND `install/mbproxy.config.template.json` IN THE SAME PR AS THE CODE.** Doc drift is a gate fail. The architectural pivot must be visible across all reader-facing surfaces.
## Implementation clarifications discovered during this phase
The following clarifications were resolved while implementing Phase 11 — recorded here so
the next agent doesn't re-derive them:
- **`CacheKey` vs `CoalescingKey` — kept SEPARATE (no aliasing).** The two records carry
the same dimensions but live in different namespaces (`Mbproxy.Proxy.Cache` vs
`Mbproxy.Proxy.Multiplexing`). Aliasing them would couple the two phases' evolution; a
duplicate 4-field record-struct is cheap enough to justify keeping them independent.
Per-key equality is record-struct value equality; the two types are never compared.
- **`CacheEntry.LastUsedTick` is a `long`, not `ushort`.** The phase doc proposed `ushort`
but the LRU comparison needs to survive >65K touches in a long-running process. The
signed-long ticker stamp suffices for the lifetime of any reasonable deployment.
- **No-cacheable-tag PLCs skip the cache entirely.** When a PLC's resolved tag map has no
entry with `CacheTtlMs > 0`, `ProxyWorker` (and `ConfigReconciler` on reseat/add)
builds the `PerPlcContext` with `Cache = null`. The multiplexer's cache check is a
no-op on a null cache, and no eviction timer is started. The "default OFF = byte-
identical to Phase 10" regression test (`Cache_DisabledByDefault_*`) lands on this code
path.
- **Cache check runs BEFORE `EnsureBackendConnectedAsync`.** A cache hit serves the
upstream client even when the backend is currently unreachable. This is intentional and
matches the design contract bullet "cache survives backend disconnects." Verified by the
unit-level `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_*` test.
- **FC06 / FC16 invalidation requires startAddr/qty parsing.** The multiplexer's request
parser previously only extracted start/qty for FC03/FC04. Phase 11 extends it to
FC06 (qty = 1) and FC16 (qty from request) so the InFlightRequest carries the write
span; the response path then invalidates by overlap using those values.
- **Cache eviction loop uses `PeriodicTimer`.** Per the phase doc; clamps the interval
to a 100 ms floor (operator-configurable down to that) so a misconfigured
`EvictionIntervalMs = 0` doesn't become a tight loop.
- **Write invalidation only fires on SUCCESSFUL responses.** The post-rewriter check at
the backend reader inspects the response FC byte for the exception-bit (`& 0x80`). An
exception response on FC06 / FC16 (e.g. PLC in PROGRAM mode → code 04) does NOT
invalidate — consistent with "the write didn't take effect."
- **Pre-existing flake in `BackendDisconnect_CascadesToAllUpstreams`** hardened with a
poll loop. The race window between "upstream EOF observed" and "BackendDisconnectCascades
counter incremented in `TearDownBackendAsync`" is inherent to the multiplexer's
serial-pipe-dispose loop; the test now polls for up to 1 s for the counter to reach 3.
Behaviour is unchanged.
## Cross-references
- Phase 9's multiplexer is the chokepoint that hosts the cache check: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).