mbproxy/docs: pivot design contract for Phase 11 response cache
Lands the design-contract pivot ahead of any cache implementation code so reviewers can evaluate the change to the "purely transparent proxy" stance independently of the Phase-11 code that depends on it. - docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes sections to acknowledge the opt-in cache; add new "Response cache (Phase 11)" section covering lookup order (cache -> coalesce -> backend), multi- tag range TTL = min, post-rewriter storage, address-range-overlap write invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate, and LRU-bounded capacity. Extend log event table with mbproxy.cache.* events. Extend per-PLC status field table with cacheHitCount / cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes. Extend hot-reload propagation table with CacheTtlMs / Cache.* rows. - docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11" to "shipped in Phase 11" and add Tier 2.4a cache-memory section. - CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture headline bullets to reflect the transparent-by-default + opt-in-cache contract; flip "Implementation complete through Phase 10" to "through Phase 11". - install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache block and a CacheTtlMs example on a BcdTags.Global entry, with prominent staleness commentary documenting the design contract. No code changes in this commit - implementation lands in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
+7
-6
@@ -11,7 +11,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
|
||||
|
||||
### Purpose: bidirectional BCD rewrite inline on the MBTCP stream
|
||||
|
||||
The service is **not** a polling/cache layer. It is a transparent Modbus TCP proxy whose job is to **rewrite the configured BCD tags in real time, in both directions**, while proxying every other byte of the MBTCP connection untouched:
|
||||
The service is a **transparent-by-default** Modbus TCP proxy whose job is to **rewrite the configured BCD tags in real time, in both directions**, while proxying every other byte of the MBTCP connection untouched. Since Phase 11 the proxy also exposes an **opt-in per-tag response cache** (default OFF, opt-in by setting `BcdTagOptions.CacheTtlMs > 0`); with caching enabled the proxy is no longer purely transparent — upstream reads may return a value up to `CacheTtlMs` milliseconds old.
|
||||
|
||||
- **Upstream read path (client → PLC → client).** When a client reads a register on the BCD tag list, the proxy intercepts the PLC's response and rewrites the raw BCD nibbles (`0x1234`) into the binary integer the client expects (`0x04D2` = decimal 1234) before forwarding. 32-bit BCD values that span the CDAB word pair are rewritten as a unit.
|
||||
- **Downstream write path (client → PLC).** When a client writes a register on the BCD tag list, the proxy intercepts the request and re-encodes the client's binary integer (`0x04D2`) into BCD nibbles (`0x1234`) before forwarding to the PLC, so the value the operator sees in ladder matches what the client wrote.
|
||||
@@ -24,21 +24,22 @@ The integration win is that upstream consumers (Wonderware / Historian / OPC UA
|
||||
The full design plan is in **[`docs/design.md`](docs/design.md)** — settled 2026-05-13, updated for Phase 9 multiplexing on 2026-05-14. Headline choices the agent should keep in mind without opening that file:
|
||||
|
||||
- **One `TcpListener` per PLC** (54 distinct ports). Each PLC has **one shared backend socket** owned by a `PlcMultiplexer`; many upstream clients are multiplexed onto that single backend via MBAP TxId rewriting (Phase 9). The H2-ECOM100's 4-client cap no longer caps upstream connections.
|
||||
- **Transparent pass-through** of every byte except the MBAP TxId field (rewritten by the multiplexer on each request and restored on each response) and FC03/FC04 response payloads + FC06/FC16 request payloads at configured BCD addresses (re-encoded between BCD nibbles and binary integers).
|
||||
- **Transparent by default; opt-in cached** (Phase 11). Every byte passes through unchanged except the MBAP TxId field (rewritten by the multiplexer on each request and restored on each response) and FC03/FC04 response payloads + FC06/FC16 request payloads at configured BCD addresses (re-encoded between BCD nibbles and binary integers). With Phase 11, FC03/FC04 reads for tags whose `CacheTtlMs > 0` may be served from a per-PLC in-process cache without backend traffic; the cache is **OFF by default** per tag.
|
||||
- **In-flight FC03/FC04 read coalescing** (Phase 10): same-key reads arriving while a peer is in flight attach to the existing `InFlightRequest.InterestedParties` list; the single backend response fans out to every attached client with original TxIds restored. Zero post-response staleness — coalescing entries die when the response arrives. Hot-reload via `Mbproxy.Resilience.ReadCoalescing.Enabled`.
|
||||
- **Optional response cache** (Phase 11) with per-tag TTL (default 0 = off). Lookup order is **cache → coalesce → backend**: a cache hit short-circuits Phase 10's coalescing path entirely. Multi-tag read range: effective TTL = `min(TTLs)`; any tag with `CacheTtlMs = 0` in the range disables caching for the whole read. Successful FC06/FC16 write responses invalidate cached FC03/FC04 entries whose address range overlaps the write. Cache stores POST-rewriter bytes (hits never re-invoke the rewriter). No persistence — process restart wipes the cache. `Cache.AllowLongTtl = true` is required for any `CacheTtlMs > 60_000`.
|
||||
- **Polly-backed listener supervisor** auto-recovers any listener that fails to bind at startup or faults at runtime; the same code path also brings up newly-added PLCs from hot-reload and tears down removed ones.
|
||||
- **`appsettings.json` is hot-reloadable** via `IOptionsMonitor<MbproxyOptions>`; tag-list changes propagate per-PDU, PLC add/remove flows through the supervisor.
|
||||
- **`appsettings.json` is hot-reloadable** via `IOptionsMonitor<MbproxyOptions>`; tag-list changes propagate per-PDU, PLC add/remove flows through the supervisor. A tag-list reload flushes the affected PLC's response cache (per-tag granularity intentionally not done in v1).
|
||||
- **Polly bounded retries** on backend connect (3 attempts at 100ms / 500ms / 2000ms). No retries on mid-request failures (FC06/FC16 are non-idempotent on BCD tags). A per-request watchdog in the multiplexer surfaces Modbus exception 0x0B to the upstream client if a backend response never arrives within `BackendRequestTimeoutMs`.
|
||||
- **Backend disconnect cascades upstream**: when the shared backend socket dies, every attached upstream pipe is closed in the same cycle (counter `BackendDisconnectCascades`); clients reconnect on their next request.
|
||||
- **Read-only Kestrel admin port** (default 8080) exposes `GET /` (auto-refreshing HTML) and `GET /status.json` with service-wide and per-PLC counters (including Phase-9 mux fields `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth` and Phase-10 coalescing fields `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`).
|
||||
- **Read-only Kestrel admin port** (default 8080) exposes `GET /` (auto-refreshing HTML) and `GET /status.json` with service-wide and per-PLC counters (including Phase-9 mux fields, Phase-10 coalescing fields, and Phase-11 cache fields `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes`).
|
||||
|
||||
Anything beyond this short list — JSON schema, propagation table, stable log event names, status counter catalog, test plan — lives in `docs/design.md`. Open that doc before writing code; keep it in sync when decisions change.
|
||||
|
||||
## Current state
|
||||
|
||||
**Implementation complete through Phase 10.** Phases 00–08 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model without changing the transparent-rewrite contract; Phase 10 added in-flight read coalescing as an additive optimization on top of the multiplexer. The service is production-ready as a Windows Service:
|
||||
**Implementation complete through Phase 11.** Phases 00–08 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model; Phase 10 added in-flight read coalescing on top; Phase 11 added an opt-in per-tag response cache (bounded staleness, OFF by default — see "Response cache" in `docs/design.md`). The service is production-ready as a Windows Service:
|
||||
|
||||
- 325 tests passing: 282 unit tests + 43 E2E tests (against the pymodbus DL205 simulator + stub backends).
|
||||
- Test count grew through Phase 11 (see `tests/Mbproxy.Tests/` for the current suite; previous baseline was 325 = 282 unit + 43 E2E).
|
||||
- Single-file self-contained publish (`dotnet publish -c Release -r win-x64`).
|
||||
- PowerShell install/uninstall scripts under `install/`.
|
||||
- Graceful shutdown with configurable drain timeout (`Connection.GracefulShutdownTimeoutMs`, default 10 s).
|
||||
|
||||
+62
-2
@@ -77,11 +77,18 @@ All configuration lives in one file, loaded via `Microsoft.Extensions.Configurat
|
||||
"Resilience": {
|
||||
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] },
|
||||
"ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 }
|
||||
},
|
||||
"Cache": {
|
||||
"AllowLongTtl": false, // gate for any tag CacheTtlMs > 60_000
|
||||
"MaxEntriesPerPlc": 1000,
|
||||
"EvictionIntervalMs": 5000
|
||||
}
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
A BCD tag may optionally carry `CacheTtlMs` (default 0 = off); a `PlcOptions` entry may optionally carry `DefaultCacheTtlMs` (default 0 = off). Resolution order: explicit per-tag → per-PLC default → 0.
|
||||
|
||||
**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must:
|
||||
|
||||
- reject duplicate addresses within a single PLC's resolved list
|
||||
@@ -100,7 +107,9 @@ All configuration lives in one file, loaded via `Microsoft.Extensions.Configurat
|
||||
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. |
|
||||
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
|
||||
| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. |
|
||||
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
|
||||
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` (Phase 11) | Tag-map reseat for the affected PLC drops the entire PLC cache; entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented in v1. |
|
||||
| `Cache.AllowLongTtl`, `Cache.MaxEntriesPerPlc`, `Cache.EvictionIntervalMs` (Phase 11) | `AllowLongTtl` is enforced on next reload-validation; `MaxEntriesPerPlc` applies to subsequent inserts (existing entries not pruned); `EvictionIntervalMs` is read by each fresh eviction loop. |
|
||||
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
|
||||
|
||||
Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta.
|
||||
|
||||
@@ -121,7 +130,7 @@ After Phase 10, FC03 / FC04 requests are additionally subject to **in-flight rea
|
||||
|
||||
Properties:
|
||||
|
||||
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. The proxy is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision.
|
||||
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. Coalescing alone is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision. (Phase 11 layers an opt-in cache on top — see "Response cache" below.)
|
||||
- **Only FC03 / FC04.** Writes (FC06 / FC16) are non-idempotent on BCD tags and never coalesced. Different function codes never share a `CoalescingKey` even at the same address (FC03 and FC04 read different Modbus tables). Different `unitId` bytes never coalesce (different PLC personalities behind a shared socket).
|
||||
- **Bounded fan-out via `MaxParties`** (default 32 in `Mbproxy.Resilience.ReadCoalescing.MaxParties`). Once an entry has `MaxParties` interested clients, the next arrival opens a fresh entry — bounds the response-fanout cost per entry at O(MaxParties) and shields the backend reader task from pathological pile-on.
|
||||
- **Hot-reloadable on/off.** `Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. Flipping it to `false` at runtime leaves running coalesced entries to drain naturally; subsequent FC03/04 requests take the Phase-9 (one round-trip per upstream request) path.
|
||||
@@ -129,6 +138,46 @@ Properties:
|
||||
|
||||
Counter accounting balance (per snapshot): `coalescedHitCount + coalescedMissCount` equals the total FC03 + FC04 requests seen since the multiplexer was constructed. Both counters increment regardless of whether the coalescing feature is enabled — `coalescedHitCount` is 0 when disabled, but every read still increments `coalescedMissCount`.
|
||||
|
||||
## Response cache (Phase 11) — opt-in bounded-staleness cache
|
||||
|
||||
**⚠ Design-contract pivot.** Through Phase 10 the proxy is *purely transparent* — every upstream read corresponds 1:1 to a recent backend round-trip (or, with Phase 10, to a peer's in-flight backend round-trip in the same microseconds-to-milliseconds window). Phase 11 changes that contract: the proxy gains an **opt-in per-tag response cache** that may serve upstream FC03/FC04 reads from in-process memory with bounded staleness up to the operator-configured `CacheTtlMs`. **The cache is OFF by default** (`CacheTtlMs = 0` on every BCD tag unless explicitly set); a fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. Operators opt tags in explicitly as their acknowledgement of the staleness window.
|
||||
|
||||
### Cache contract
|
||||
|
||||
- **Per-tag TTL.** Each BCD tag carries an optional `CacheTtlMs` (in `BcdTagOptions`). `CacheTtlMs = 0` (the default) disables caching for that tag. The TTL resolution order is **explicit per-tag → per-PLC `DefaultCacheTtlMs` → 0**.
|
||||
- **Multi-tag read range: effective TTL = `min(TTLs)`.** When a single FC03/FC04 read covers multiple configured tags, the cache uses the smallest TTL among them. If any tag in the read range has `CacheTtlMs = 0`, the **whole read is uncached** — the conservative-by-design choice.
|
||||
- **Lookup order: cache → coalesce → backend.** A cache hit short-circuits Phase 10's coalescing entirely. Only on a miss does the request engage coalescing (Phase 10) and then the Phase 9 backend send path.
|
||||
- **Cache populates on demand only.** No polling, no predictive prefetch. Entries are created in the backend reader task **after** the BCD rewriter has run on the response — the cache stores **POST-rewriter bytes**, so hits never re-invoke the rewriter (CPU win + behaviour-stable).
|
||||
- **Write invalidation by ADDRESS RANGE OVERLAP.** A successful FC06 / FC16 response (non-exception) invalidates every cached FC03/FC04 entry whose address range `[StartAddress, StartAddress + Qty)` overlaps the write range. A write to register 105 invalidates a cached `[100..110]` read but not a cached `[200..210]` read. Exception responses do not invalidate (the write didn't take effect).
|
||||
- **Different unit IDs never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3,4})`.
|
||||
- **Cache survives backend disconnects.** A cached entry's data was valid when stored; a disconnect does not retroactively invalidate it. Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid).
|
||||
- **No persistence.** Process restart wipes the cache. No file/Redis backing store, no last-known-good snapshot.
|
||||
- **Hot-reload flushes the entire PLC cache.** Any tag-list change to a PLC drops every cached entry for that PLC. Per-tag flush granularity is intentionally not done in v1 — the simple correctness move is "any tag-list reload → drop all entries for the affected PLC and let them re-populate."
|
||||
- **TTL > 60 s requires `Cache.AllowLongTtl = true`.** Validation rejects reloads that set `CacheTtlMs > 60_000` without this opt-in. Prevents "left at 1 hour by accident" deployments.
|
||||
- **LRU-bounded capacity.** Each PLC's cache is capped at `Cache.MaxEntriesPerPlc` (default 1000). When full, the next insert evicts the least-recently-used entry. A background eviction loop (interval `Cache.EvictionIntervalMs`, default 5000) also scans for expired entries.
|
||||
|
||||
### Cache and the rewriter
|
||||
|
||||
The BCD rewriter runs once on the cache-miss path (the backend reader task decodes the response and stores the decoded bytes in the cache). Cache hits return pre-decoded bytes directly without re-invoking the rewriter — this is both a CPU optimisation and a correctness guarantee (any future rewriter change would not retroactively re-transform an entry that was decoded against an earlier rewriter version).
|
||||
|
||||
### Hot-reload semantics
|
||||
|
||||
| Change | Cache behaviour |
|
||||
|--------|----------------|
|
||||
| Tag's `CacheTtlMs` changed (any direction, 0 → N, N → 0, N → M) | Entire PLC cache is flushed; entries re-populate on demand under the new TTL. |
|
||||
| New PLC added / removed | New PLC starts with empty cache; removed PLC's cache is discarded with the multiplexer. |
|
||||
| `Cache.AllowLongTtl` flipped | Validation runs on next reload; existing entries unaffected. |
|
||||
| `Cache.MaxEntriesPerPlc` changed | Existing entries unaffected; cap applies to subsequent inserts. |
|
||||
| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues until next dispose; subsequent loops use new interval. |
|
||||
|
||||
### Counter accounting
|
||||
|
||||
- `cacheHitCount` — FC03/FC04 requests served from the cache.
|
||||
- `cacheMissCount` — FC03/FC04 requests that fell through to the coalescing/backend path. (Cache hit + Cache miss = total FC03/FC04 requests that were cache-eligible, i.e. whose resolved TTL was > 0; reads whose effective TTL is 0 increment neither.)
|
||||
- `cacheInvalidations` — count of cache entries invalidated by FC06/FC16 write responses.
|
||||
- `cacheEntryCount` — point-in-time snapshot of `ResponseCache.Count` (Tier-2 memory-watch KPI).
|
||||
- `cacheBytes` — point-in-time approximation of cached PDU bytes (Tier-2 memory-watch KPI).
|
||||
|
||||
## Rewriter — function code scope
|
||||
|
||||
The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte:
|
||||
@@ -151,6 +200,7 @@ The rewriter inspects and rewrites payloads only for these function codes; every
|
||||
- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path.
|
||||
- **Partial-BCD overlap** → forward raw + warn (see Rewriter section).
|
||||
- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients.
|
||||
- **Cache during backend recovery (Phase 11).** Cache hits remain valid during a `recovering` listener state — the data was fresh when cached, and recovery only affects future requests. Writes that arrive during recovery never reach the backend, so the invalidation never happens. This is consistent: the write also didn't take effect on the PLC. Cached entries simply remain until their TTL expires.
|
||||
|
||||
## Startup posture — eager, continue on per-port failure
|
||||
|
||||
@@ -193,6 +243,11 @@ Stable event names (keep these stable so log queries don't churn):
|
||||
| `mbproxy.coalesce.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `PartyCount` |
|
||||
| `mbproxy.coalesce.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.coalesce.dead_upstream` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||||
| `mbproxy.cache.store` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `TtlMs` |
|
||||
| `mbproxy.cache.invalidated` | Debug | `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count` |
|
||||
| `mbproxy.cache.flushed` | Info | `Plc`, `Reason`, `Count` |
|
||||
|
||||
## Status page — read-only HTTP endpoint
|
||||
|
||||
@@ -234,6 +289,11 @@ Authentication is assumed to live at the network layer (trusted internal segment
|
||||
| `backend.coalescedHitCount` | FC03/04 requests that attached to an already-in-flight peer (Phase 10) |
|
||||
| `backend.coalescedMissCount` | FC03/04 requests that opened a fresh backend round-trip (Phase 10). `Hit + Miss` = total FC03/04 requests |
|
||||
| `backend.coalescedResponseToDeadUpstream` | Coalesced fan-out responses skipped because the attached upstream had already disconnected (Phase 10) |
|
||||
| `backend.cacheHitCount` | FC03/04 reads served from the response cache (Phase 11) |
|
||||
| `backend.cacheMissCount` | FC03/04 reads that fell through to coalescing/backend after a cache miss (Phase 11) |
|
||||
| `backend.cacheInvalidations` | Cache entries invalidated by overlapping FC06/FC16 write responses (Phase 11) |
|
||||
| `backend.cacheEntryCount` | Point-in-time snapshot of the per-PLC cache's entry count (Phase 11, Tier-2 memory-watch) |
|
||||
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC (Phase 11, Tier-2 memory-watch) |
|
||||
| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction |
|
||||
|
||||
Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path.
|
||||
|
||||
+14
-3
@@ -23,7 +23,7 @@ For context — every recommended addition below is *in addition to* this list.
|
||||
| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
|
||||
| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
|
||||
| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
|
||||
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream` |
|
||||
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` |
|
||||
| Per-PLC bytes | `upstreamIn`, `upstreamOut` |
|
||||
|
||||
Counters are **cumulative since process start**. A restart resets them.
|
||||
@@ -125,9 +125,9 @@ Same-key FC03/04 reads within the in-flight window attach to one another instead
|
||||
|
||||
**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
|
||||
|
||||
### 1.8 Response cache — **[requires Phase 11](plan/11-response-cache.md)**
|
||||
### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)**
|
||||
|
||||
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.
|
||||
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry).
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|-----|------------|--------|--------|-------|--------|
|
||||
@@ -174,6 +174,17 @@ Reach for these once Tier 1 is solid. They add depth for specific operational sc
|
||||
|
||||
**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
|
||||
|
||||
### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)**
|
||||
|
||||
When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|-----|------------|--------|--------|-------|--------|
|
||||
| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) |
|
||||
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |
|
||||
|
||||
**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.
|
||||
|
||||
### 2.4 Memory / process health
|
||||
|
||||
| KPI | Definition | Source | Widget | Alert | Effort |
|
||||
|
||||
@@ -33,7 +33,13 @@
|
||||
{ "Address": 1056, "Width": 32 },
|
||||
|
||||
// V2100 (octal) = decimal address 1088. 16-bit BCD setpoint.
|
||||
{ "Address": 1088, "Width": 16 }
|
||||
//
|
||||
// Phase 11: CacheTtlMs (optional) opts this tag into the response cache. With
|
||||
// CacheTtlMs > 0 set, upstream clients reading this register will see values up
|
||||
// to CacheTtlMs MILLISECONDS OLD — explicit acknowledgement of the staleness
|
||||
// window is required by enabling it. Default (omitted or 0) = cache disabled
|
||||
// for this tag. The cache is OFF by default for every tag.
|
||||
{ "Address": 1088, "Width": 16 /* , "CacheTtlMs": 1000 */ }
|
||||
]
|
||||
},
|
||||
|
||||
@@ -143,6 +149,40 @@
|
||||
"Enabled": true,
|
||||
"MaxParties": 32
|
||||
}
|
||||
},
|
||||
|
||||
// ── Response cache (Phase 11) — opt-in bounded-staleness cache ──────────────────
|
||||
//
|
||||
// ⚠ DESIGN-CONTRACT PIVOT: with caching enabled the proxy is no longer purely
|
||||
// transparent. Upstream FC03/FC04 reads for cache-enabled tags may return values
|
||||
// up to CacheTtlMs MILLISECONDS OLD. Operators opt tags in by setting a non-zero
|
||||
// CacheTtlMs on a BcdTagOptions entry (or DefaultCacheTtlMs on a PlcOptions entry).
|
||||
//
|
||||
// The cache is OFF BY DEFAULT for every tag. A deployment with NO TTL config (this
|
||||
// section entirely absent and no BcdTags.*.CacheTtlMs / Plcs[i].DefaultCacheTtlMs)
|
||||
// behaves IDENTICALLY to a pre-Phase-11 deployment — no behaviour change.
|
||||
//
|
||||
// AllowLongTtl — gate for any CacheTtlMs > 60_000. Reload validation
|
||||
// rejects configs that exceed 60 s without this opt-in,
|
||||
// to prevent accidentally-stale-for-an-hour deployments.
|
||||
// MaxEntriesPerPlc — LRU cap per-PLC. Past this cap, the next insert evicts
|
||||
// the least-recently-used entry. Defaults to 1000.
|
||||
// EvictionIntervalMs — background eviction tick. Scans each PLC's cache and
|
||||
// removes entries past their TTL. Defaults to 5000.
|
||||
//
|
||||
// Properties (full text in docs/design.md → "Response cache"):
|
||||
// * Cache hits SHORT-CIRCUIT coalescing entirely (cache → coalesce → backend).
|
||||
// * Successful FC06/FC16 write responses invalidate every cached FC03/FC04 entry
|
||||
// whose address range OVERLAPS the write — not just exact-key match.
|
||||
// * Multi-tag read range: effective TTL = min(TTLs). Any tag with TTL=0 in the
|
||||
// range disables caching for the whole read.
|
||||
// * Cache stores POST-rewriter bytes; hits never re-invoke the BCD rewriter.
|
||||
// * Tag-list hot-reload flushes the affected PLC's whole cache.
|
||||
// * No persistence — process restart wipes the cache.
|
||||
"Cache": {
|
||||
"AllowLongTtl": false,
|
||||
"MaxEntriesPerPlc": 1000,
|
||||
"EvictionIntervalMs": 5000
|
||||
}
|
||||
},
|
||||
|
||||
|
||||
Reference in New Issue
Block a user