mbproxy/docs: pivot design contract for Phase 11 response cache
Lands the design-contract pivot ahead of any cache implementation code so reviewers can evaluate the change to the "purely transparent proxy" stance independently of the Phase-11 code that depends on it. - docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes sections to acknowledge the opt-in cache; add new "Response cache (Phase 11)" section covering lookup order (cache -> coalesce -> backend), multi- tag range TTL = min, post-rewriter storage, address-range-overlap write invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate, and LRU-bounded capacity. Extend log event table with mbproxy.cache.* events. Extend per-PLC status field table with cacheHitCount / cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes. Extend hot-reload propagation table with CacheTtlMs / Cache.* rows. - docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11" to "shipped in Phase 11" and add Tier 2.4a cache-memory section. - CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture headline bullets to reflect the transparent-by-default + opt-in-cache contract; flip "Implementation complete through Phase 10" to "through Phase 11". - install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache block and a CacheTtlMs example on a BcdTags.Global entry, with prominent staleness commentary documenting the design contract. No code changes in this commit - implementation lands in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+62
-2
@@ -77,11 +77,18 @@ All configuration lives in one file, loaded via `Microsoft.Extensions.Configurat
|
||||
"Resilience": {
|
||||
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] },
|
||||
"ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 }
|
||||
},
|
||||
"Cache": {
|
||||
"AllowLongTtl": false, // gate for any tag CacheTtlMs > 60_000
|
||||
"MaxEntriesPerPlc": 1000,
|
||||
"EvictionIntervalMs": 5000
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
A BCD tag may optionally carry `CacheTtlMs` (default 0 = off); a `PlcOptions` entry may optionally carry `DefaultCacheTtlMs` (default 0 = off). Resolution order: explicit per-tag → per-PLC default → 0.
|
||||
|
||||
**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must:
|
||||
|
||||
- reject duplicate addresses within a single PLC's resolved list
|
||||
@@ -100,7 +107,9 @@ All configuration lives in one file, loaded via `Microsoft.Extensions.Configurat
|
||||
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. |
|
||||
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
|
||||
| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. |
|
||||
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
|
||||
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` (Phase 11) | Tag-map reseat for the affected PLC drops the entire PLC cache; entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented in v1. |
|
||||
| `Cache.AllowLongTtl`, `Cache.MaxEntriesPerPlc`, `Cache.EvictionIntervalMs` (Phase 11) | `AllowLongTtl` is enforced on next reload-validation; `MaxEntriesPerPlc` applies to subsequent inserts (existing entries not pruned); `EvictionIntervalMs` is read by each fresh eviction loop. |
|
||||
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
|
||||
|
||||
Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta.
|
||||
|
||||
@@ -121,7 +130,7 @@ After Phase 10, FC03 / FC04 requests are additionally subject to **in-flight rea
|
||||
|
||||
Properties:
|
||||
|
||||
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. The proxy is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision.
|
||||
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. Coalescing alone is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision. (Phase 11 layers an opt-in cache on top — see "Response cache" below.)
|
||||
- **Only FC03 / FC04.** Writes (FC06 / FC16) are non-idempotent on BCD tags and never coalesced. Different function codes never share a `CoalescingKey` even at the same address (FC03 and FC04 read different Modbus tables). Different `unitId` bytes never coalesce (different PLC personalities behind a shared socket).
|
||||
- **Bounded fan-out via `MaxParties`** (default 32 in `Mbproxy.Resilience.ReadCoalescing.MaxParties`). Once an entry has `MaxParties` interested clients, the next arrival opens a fresh entry — bounds the response-fanout cost per entry at O(MaxParties) and shields the backend reader task from pathological pile-on.
|
||||
- **Hot-reloadable on/off.** `Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. Flipping it to `false` at runtime leaves running coalesced entries to drain naturally; subsequent FC03/04 requests take the Phase-9 (one round-trip per upstream request) path.
|
||||
@@ -129,6 +138,46 @@ Properties:
|
||||
|
||||
Counter accounting balance (per snapshot): `coalescedHitCount + coalescedMissCount` equals the total FC03 + FC04 requests seen since the multiplexer was constructed. Both counters increment regardless of whether the coalescing feature is enabled — `coalescedHitCount` is 0 when disabled, but every read still increments `coalescedMissCount`.
|
||||
|
||||
## Response cache (Phase 11) — opt-in bounded-staleness cache
|
||||
|
||||
**⚠ Design-contract pivot.** Through Phase 10 the proxy is *purely transparent* — every upstream read corresponds 1:1 to a recent backend round-trip (or, with Phase 10, to a peer's in-flight backend round-trip in the same microseconds-to-milliseconds window). Phase 11 changes that contract: the proxy gains an **opt-in per-tag response cache** that may serve upstream FC03/FC04 reads from in-process memory with bounded staleness up to the operator-configured `CacheTtlMs`. **The cache is OFF by default** (`CacheTtlMs = 0` on every BCD tag unless explicitly set); a fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. Operators opt tags in explicitly as their acknowledgement of the staleness window.
|
||||
|
||||
### Cache contract
|
||||
|
||||
- **Per-tag TTL.** Each BCD tag carries an optional `CacheTtlMs` (in `BcdTagOptions`). `CacheTtlMs = 0` (the default) disables caching for that tag. The TTL resolution order is **explicit per-tag → per-PLC `DefaultCacheTtlMs` → 0**.
|
||||
- **Multi-tag read range: effective TTL = `min(TTLs)`.** When a single FC03/FC04 read covers multiple configured tags, the cache uses the smallest TTL among them. If any tag in the read range has `CacheTtlMs = 0`, the **whole read is uncached** — the conservative-by-design choice.
|
||||
- **Lookup order: cache → coalesce → backend.** A cache hit short-circuits Phase 10's coalescing entirely. Only on a miss does the request engage coalescing (Phase 10) and then the Phase 9 backend send path.
|
||||
- **Cache populates on demand only.** No polling, no predictive prefetch. Entries are created in the backend reader task **after** the BCD rewriter has run on the response — the cache stores **POST-rewriter bytes**, so hits never re-invoke the rewriter (CPU win + behaviour-stable).
|
||||
- **Write invalidation by ADDRESS RANGE OVERLAP.** A successful FC06 / FC16 response (non-exception) invalidates every cached FC03/FC04 entry whose address range `[StartAddress, StartAddress + Qty)` overlaps the write range. A write to register 105 invalidates a cached `[100..110]` read but not a cached `[200..210]` read. Exception responses do not invalidate (the write didn't take effect).
|
||||
- **Different unit IDs never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3,4})`.
|
||||
- **Cache survives backend disconnects.** A cached entry's data was valid when stored; a disconnect does not retroactively invalidate it. Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid).
|
||||
- **No persistence.** Process restart wipes the cache. No file/Redis backing store, no last-known-good snapshot.
|
||||
- **Hot-reload flushes the entire PLC cache.** Any tag-list change to a PLC drops every cached entry for that PLC. Per-tag flush granularity is intentionally not done in v1 — the simple correctness move is "any tag-list reload → drop all entries for the affected PLC and let them re-populate."
|
||||
- **TTL > 60 s requires `Cache.AllowLongTtl = true`.** Validation rejects reloads that set `CacheTtlMs > 60_000` without this opt-in. Prevents "left at 1 hour by accident" deployments.
|
||||
- **LRU-bounded capacity.** Each PLC's cache is capped at `Cache.MaxEntriesPerPlc` (default 1000). When full, the next insert evicts the least-recently-used entry. A background eviction loop (interval `Cache.EvictionIntervalMs`, default 5000) also scans for expired entries.
|
||||
|
||||
### Cache and the rewriter
|
||||
|
||||
The BCD rewriter runs once on the cache-miss path (the backend reader task decodes the response and stores the decoded bytes in the cache). Cache hits return pre-decoded bytes directly without re-invoking the rewriter — this is both a CPU optimisation and a correctness guarantee (any future rewriter change would not retroactively re-transform an entry that was decoded against an earlier rewriter version).
|
||||
|
||||
### Hot-reload semantics
|
||||
|
||||
| Change | Cache behaviour |
|
||||
|--------|----------------|
|
||||
| Tag's `CacheTtlMs` changed (any direction, 0 → N, N → 0, N → M) | Entire PLC cache is flushed; entries re-populate on demand under the new TTL. |
|
||||
| New PLC added / removed | New PLC starts with empty cache; removed PLC's cache is discarded with the multiplexer. |
|
||||
| `Cache.AllowLongTtl` flipped | Validation runs on next reload; existing entries unaffected. |
|
||||
| `Cache.MaxEntriesPerPlc` changed | Existing entries unaffected; cap applies to subsequent inserts. |
|
||||
| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues until next dispose; subsequent loops use new interval. |
|
||||
|
||||
### Counter accounting
|
||||
|
||||
- `cacheHitCount` — FC03/FC04 requests served from the cache.
|
||||
- `cacheMissCount` — FC03/FC04 requests that fell through to the coalescing/backend path. (Cache hit + Cache miss = total FC03/FC04 requests that were cache-eligible, i.e. whose resolved TTL was > 0; reads whose effective TTL is 0 increment neither.)
|
||||
- `cacheInvalidations` — count of cache entries invalidated by FC06/FC16 write responses.
|
||||
- `cacheEntryCount` — point-in-time snapshot of `ResponseCache.Count` (Tier-2 memory-watch KPI).
|
||||
- `cacheBytes` — point-in-time approximation of cached PDU bytes (Tier-2 memory-watch KPI).
|
||||
|
||||
## Rewriter — function code scope
|
||||
|
||||
The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte:
|
||||
@@ -151,6 +200,7 @@ The rewriter inspects and rewrites payloads only for these function codes; every
|
||||
- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path.
|
||||
- **Partial-BCD overlap** → forward raw + warn (see Rewriter section).
|
||||
- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients.
|
||||
- **Cache during backend recovery (Phase 11).** Cache hits remain valid during a `recovering` listener state — the data was fresh when cached, and recovery only affects future requests. Writes that arrive during recovery never reach the backend, so the invalidation never happens. This is consistent: the write also didn't take effect on the PLC. Cached entries simply remain until their TTL expires.
|
||||
|
||||
## Startup posture — eager, continue on per-port failure
|
||||
|
||||
@@ -193,6 +243,11 @@ Stable event names (keep these stable so log queries don't churn):
|
||||
| `mbproxy.coalesce.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `PartyCount` |
|
||||
| `mbproxy.coalesce.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.coalesce.dead_upstream` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.store` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `TtlMs` |
|
||||
| `mbproxy.cache.invalidated` | Debug | `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count` |
|
||||
| `mbproxy.cache.flushed` | Info | `Plc`, `Reason`, `Count` |
|
||||
|
||||
## Status page — read-only HTTP endpoint
|
||||
|
||||
@@ -234,6 +289,11 @@ Authentication is assumed to live at the network layer (trusted internal segment
|
||||
| `backend.coalescedHitCount` | FC03/04 requests that attached to an already-in-flight peer (Phase 10) |
|
||||
| `backend.coalescedMissCount` | FC03/04 requests that opened a fresh backend round-trip (Phase 10). `Hit + Miss` = total FC03/04 requests |
|
||||
| `backend.coalescedResponseToDeadUpstream` | Coalesced fan-out responses skipped because the attached upstream had already disconnected (Phase 10) |
|
||||
| `backend.cacheHitCount` | FC03/04 reads served from the response cache (Phase 11) |
|
||||
| `backend.cacheMissCount` | FC03/04 reads that fell through to coalescing/backend after a cache miss (Phase 11) |
|
||||
| `backend.cacheInvalidations` | Cache entries invalidated by overlapping FC06/FC16 write responses (Phase 11) |
|
||||
| `backend.cacheEntryCount` | Point-in-time snapshot of the per-PLC cache's entry count (Phase 11, Tier-2 memory-watch) |
|
||||
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC (Phase 11, Tier-2 memory-watch) |
|
||||
| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction |
|
||||
|
||||
Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path.
|
||||
|
||||
+14
-3
@@ -23,7 +23,7 @@ For context — every recommended addition below is *in addition to* this list.
|
||||
| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
|
||||
| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
|
||||
| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
|
||||
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream` |
|
||||
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` |
|
||||
| Per-PLC bytes | `upstreamIn`, `upstreamOut` |
|
||||
|
||||
Counters are **cumulative since process start**. A restart resets them.
|
||||
@@ -125,9 +125,9 @@ Same-key FC03/04 reads within the in-flight window attach to one another instead
|
||||
|
||||
**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
|
||||
|
||||
### 1.8 Response cache — **[requires Phase 11](plan/11-response-cache.md)**
|
||||
### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)**
|
||||
|
||||
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.
|
||||
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry).
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|-----|------------|--------|--------|-------|--------|
|
||||
@@ -174,6 +174,17 @@ Reach for these once Tier 1 is solid. They add depth for specific operational sc
|
||||
|
||||
**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
|
||||
|
||||
### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)**
|
||||
|
||||
When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|-----|------------|--------|--------|-------|--------|
|
||||
| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) |
|
||||
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |
|
||||
|
||||
**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.
|
||||
|
||||
### 2.4 Memory / process health
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|
||||
Reference in New Issue
Block a user