mbproxy/docs: pivot design contract for Phase 11 response cache

Lands the design-contract pivot ahead of any cache implementation code so reviewers can evaluate the change to the "purely transparent proxy" stance independently of the Phase-11 code that depends on it. - docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes sections to acknowledge the opt-in cache; add new "Response cache (Phase 11)" section covering lookup order (cache -> coalesce -> backend), multi- tag range TTL = min, post-rewriter storage, address-range-overlap write invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate, and LRU-bounded capacity. Extend log event table with mbproxy.cache.* events. Extend per-PLC status field table with cacheHitCount / cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes. Extend hot-reload propagation table with CacheTtlMs / Cache.* rows. - docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11" to "shipped in Phase 11" and add Tier 2.4a cache-memory section. - CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture headline bullets to reflect the transparent-by-default + opt-in-cache contract; flip "Implementation complete through Phase 10" to "through Phase 11". - install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache block and a CacheTtlMs example on a BcdTags.Global entry, with prominent staleness commentary documenting the design contract. No code changes in this commit - implementation lands in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:35:35 -04:00
parent a2dba4bd07
commit 892b10baf4
4 changed files with 124 additions and 12 deletions
@@ -23,7 +23,7 @@ For context — every recommended addition below is *in addition to* this list.
 | Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
 | Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
 | Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
-| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream` |
+| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` |
 | Per-PLC bytes | `upstreamIn`, `upstreamOut` |

 Counters are **cumulative since process start**. A restart resets them.
@@ -125,9 +125,9 @@ Same-key FC03/04 reads within the in-flight window attach to one another instead

 **Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.

-### 1.8 Response cache — **[requires Phase 11](plan/11-response-cache.md)**
+### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)**

-After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.
+After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry).

 | KPI | Definition | Source | Widget | Alert | Effort |
 |-----|------------|--------|--------|-------|--------|
@@ -174,6 +174,17 @@ Reach for these once Tier 1 is solid. They add depth for specific operational sc

 **Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.

+### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)**
+
+When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.
+
+| KPI | Definition | Source | Widget | Alert | Effort |
+|-----|------------|--------|--------|-------|--------|
+| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) |
+| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |
+
+**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.
+
 ### 2.4 Memory / process health

 | KPI | Definition | Source | Widget | Alert | Effort |