f49e27e316
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.
Architecture/
- Overview.md (150 lines) — listener topology, request flow, per-PLC isolation
- ConnectionModel.md (247 lines) — TxId multiplexer, watchdog, disconnect cascade
- ReadCoalescing.md (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
- ResponseCache.md (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation
Features/
- BcdRewriting.md (252 lines) — codec, CDAB, FC scope, partial-overlap policy
- HotReload.md (189 lines) — IOptionsMonitor + per-change-kind reconcile rules
Operations/
- Configuration.md (422 lines) — every Mbproxy:* option + validation rules
- StatusPage.md (334 lines) — admin endpoint surface, every JSON field
- Troubleshooting.md (364 lines) — diagnosis playbook keyed to log events
Reference/
- LogEvents.md (499 lines) — 28 events across 7 categories, grep-verified
Testing/
- Simulator.md (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk
Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.
Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
399 lines
19 KiB
Markdown
399 lines
19 KiB
Markdown
# Response Cache
|
|
|
|
The response cache is an opt-in per-tag, bounded-staleness layer that serves
|
|
FC03 and FC04 reads from in-process memory. It sits above read coalescing in
|
|
the request path so a hit avoids both the coalescing entry and the backend
|
|
round-trip entirely.
|
|
|
|
## Cache Contract
|
|
|
|
The cache is **off by default for every tag**. `CacheTtlMs = 0` on every BCD
|
|
tag is the default state, and a deployment that ships without any TTL
|
|
configuration behaves identically to one compiled without the cache at all
|
|
— no in-memory entries are created, every FC03/FC04 read falls through to
|
|
the coalescing-then-backend path, and counters that track cache activity
|
|
stay at zero.
|
|
|
|
Operators opt a tag in by setting a positive `CacheTtlMs`. That positive
|
|
value is the explicit acknowledgement of the staleness window: the operator
|
|
is stating, "I am willing for upstream clients to see a value up to N
|
|
milliseconds old in exchange for taking the read off the backend." There is
|
|
no implicit cache enablement. There is no global cache toggle that turns
|
|
caching on for previously-uncached tags. Every cached tag is one whose
|
|
configuration has a positive TTL on its line.
|
|
|
|
This stance is the design-contract pivot the cache introduces: before it,
|
|
the proxy is purely transparent except for BCD rewriting. With the cache,
|
|
the proxy is transparent **by default**, with an opt-in cache layer the
|
|
operator can engage tag-by-tag.
|
|
|
|
## TTL Resolution Order
|
|
|
|
Each FC03/FC04 read range resolves to one effective TTL through three
|
|
tiers:
|
|
|
|
1. **Explicit per-tag.** `BcdTagOptions.CacheTtlMs` on the tag entry. A
|
|
non-null value wins regardless of the per-PLC default. An explicit `0`
|
|
here disables caching for that tag even when the PLC default is
|
|
positive.
|
|
2. **Per-PLC default.** `PlcOptions.DefaultCacheTtlMs` applies to any tag
|
|
whose explicit `CacheTtlMs` is `null` (unset). A `0` default means "no
|
|
caching by default at this PLC."
|
|
3. **Zero.** With nothing set at either tier, the resolved TTL is `0` and
|
|
the read is uncached.
|
|
|
|
`BcdTagMap.ResolveCacheTtlMs(startAddress, qty)` implements the per-read
|
|
resolution. It enumerates the BCD tags whose register footprints intersect
|
|
the requested range and returns the smallest positive TTL across the hits,
|
|
or `0` if the range covers no configured tags.
|
|
|
|
```csharp
|
|
public int ResolveCacheTtlMs(ushort startAddress, ushort qty)
|
|
{
|
|
if (!TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0)
|
|
return 0;
|
|
|
|
int min = int.MaxValue;
|
|
foreach (var hit in hits)
|
|
{
|
|
int ttl = hit.Tag.CacheTtlMs;
|
|
if (ttl <= 0) return 0;
|
|
if (ttl < min) min = ttl;
|
|
}
|
|
return min == int.MaxValue ? 0 : min;
|
|
}
|
|
```
|
|
|
|
The `hit.Tag.CacheTtlMs` value resolved on each `BcdTag` already reflects
|
|
the explicit-then-default order — the options binder resolves the per-tag
|
|
override against the per-PLC default at config build time, so the runtime
|
|
hot path sees a single integer per tag.
|
|
|
|
## Multi-Tag Range TTL Rule
|
|
|
|
When a single FC03/FC04 read covers multiple configured BCD tags, the
|
|
effective TTL is the minimum across them:
|
|
|
|
```text
|
|
range covers tags { A:TTL=500, B:TTL=2000, C:TTL=100 } → effective TTL = 100
|
|
range covers tags { A:TTL=500, B:TTL=0 (uncached) } → effective TTL = 0
|
|
range covers tags { A:TTL=500 } → effective TTL = 500
|
|
range covers no configured tags → effective TTL = 0
|
|
```
|
|
|
|
If any covered tag has `CacheTtlMs = 0`, the whole read is uncached. The
|
|
rationale is conservative-by-design: a multi-tag read whose narrowest TTL
|
|
is, for example, 100 ms cannot be served safely from an entry that was
|
|
stored under a tag with TTL 2 s, because that entry's freshness was only
|
|
guaranteed by the longer window. Rather than partition a range read across
|
|
heterogeneous TTLs or invent inheritance rules that an operator would have
|
|
to reason about per-deployment, the cache refuses to serve any multi-tag
|
|
read whose narrowest covered TTL is zero. Operators who want a tag cached
|
|
in isolation but uncached when read alongside an uncached neighbour get the
|
|
expected behaviour by leaving the neighbour at `CacheTtlMs = 0`.
|
|
|
|
A read whose range covers no configured BCD tags also resolves to `0`.
|
|
There is nothing to be conservative about because the cache only serves
|
|
ranges that contain rewriter-tracked tags — a read of plain non-BCD
|
|
registers does not engage the cache regardless of any per-PLC default.
|
|
|
|
## Lookup Order
|
|
|
|
The multiplexer's FC03/FC04 path consults three tiers in fixed order:
|
|
|
|
1. **Cache.** When `_ctx.Cache` is wired and `BcdTagMap.ResolveCacheTtlMs`
|
|
returns a positive TTL for the read range, `ResponseCache.TryGet` is
|
|
called against a `CacheKey(unitId, fc, startAddress, qty)`. A hit
|
|
splices the cached payload onto a fresh MBAP header carrying the
|
|
original upstream TxId, pushes the frame onto that pipe's response
|
|
channel, and **returns without engaging coalescing or the backend at
|
|
all**.
|
|
2. **Coalesce.** On a cache miss (or when the resolved TTL is zero), the
|
|
request is offered to `InFlightByKeyMap.TryAttachOrCreate`. A hit
|
|
attaches the new party to a peer's in-flight request.
|
|
3. **Backend.** On a coalescing miss, the request opens a proxy TxId,
|
|
registers a `CorrelationMap` entry, runs the BCD rewriter on any FC06
|
|
or FC16 payload, and queues the frame onto the outbound channel.
|
|
|
|
The cache check happens **before** the multiplexer's
|
|
`EnsureBackendConnectedAsync` call. A cache hit serves the upstream even
|
|
when the backend socket is currently disconnected or recovering. This is
|
|
not an accident — the cached payload's freshness is bounded by its TTL,
|
|
not by the liveness of the backend socket. See
|
|
[`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) for
|
|
the operator view of cache-served reads during a backend outage.
|
|
|
|
## Storage Format: Post-Rewriter Bytes
|
|
|
|
`CacheEntry.PduBytes` holds the **post-rewriter response PDU body** — the
|
|
function code byte, the byte count, and the rewriter-decoded register
|
|
data, with no MBAP header. The backend reader task decodes the response
|
|
through `BcdPduPipeline` first and only then hands the rewritten payload
|
|
to `ResponseCache.Set`.
|
|
|
|
```csharp
|
|
internal sealed record CacheEntry(
|
|
byte[] PduBytes,
|
|
DateTimeOffset CachedAtUtc,
|
|
DateTimeOffset ExpiresAtUtc,
|
|
int Length,
|
|
long LastUsedTick);
|
|
```
|
|
|
|
Storing post-rewriter bytes is both a CPU optimisation and a correctness
|
|
guarantee:
|
|
|
|
- **CPU.** A cache hit returns ready-to-send bytes. The rewriter does not
|
|
re-run per hit; only the MBAP header is regenerated to carry the
|
|
upstream's original TxId.
|
|
- **Correctness.** An entry decoded against an earlier rewriter version
|
|
never gets retroactively re-transformed against a newer version. If the
|
|
rewriter's behaviour changes mid-process (it does not today, but the
|
|
guarantee is durable across future changes), in-flight cached entries
|
|
age out under their TTL and are replaced by fresh entries decoded
|
|
through the new rewriter. A bidirectional re-encode never happens to an
|
|
already-stored entry.
|
|
|
|
## Write Invalidation by Address Range Overlap
|
|
|
|
A successful (non-exception) FC06 or FC16 response invalidates every
|
|
cached FC03 or FC04 entry whose address range
|
|
`[StartAddress, StartAddress + Qty)` overlaps the write range
|
|
`[writeStart, writeStart + writeQty)`. The pure overlap math lives in
|
|
`CacheInvalidator.FindOverlapping`:
|
|
|
|
```csharp
|
|
int writeEnd = writeStart + writeQty; // half-open upper bound
|
|
|
|
foreach (var key in haystack)
|
|
{
|
|
if (key.UnitId != unitId) continue;
|
|
if (key.Fc != 0x03 && key.Fc != 0x04) continue;
|
|
|
|
int keyEnd = key.StartAddress + key.Qty;
|
|
// Overlap iff writeStart < keyEnd AND key.StartAddress < writeEnd.
|
|
if (writeStart < keyEnd && key.StartAddress < writeEnd)
|
|
hits.Add(key);
|
|
}
|
|
```
|
|
|
|
Worked examples on a single unit ID:
|
|
|
|
```text
|
|
Write to register 105 (qty=1)
|
|
└─ invalidates cached FC03 [100..110) — register 105 is inside the cached range
|
|
└─ leaves cached FC03 [200..210) untouched
|
|
|
|
Write to registers [10..15) (qty=5)
|
|
└─ leaves cached FC03 [15..20) untouched — half-open intervals, 15 is not in [10..15)
|
|
|
|
Write to registers [98..108) (qty=10)
|
|
└─ invalidates cached FC03 [100..110) — ranges overlap on [100..108)
|
|
```
|
|
|
|
Three properties of the invalidator deserve calling out:
|
|
|
|
- **Exception responses do not invalidate.** A Modbus exception (code 01,
|
|
02, 03, 04, or any other) means the write did not take effect on the
|
|
PLC. The cached read is still consistent with the device, so the
|
|
invalidator is not engaged.
|
|
- **Different unit IDs never invalidate each other.** Multi-drop and
|
|
gateway personalities behind a shared socket address logically separate
|
|
Modbus tables. `CacheKey.UnitId` discriminates.
|
|
- **Only FC03 and FC04 entries are evicted.** The cache never stores write
|
|
responses, so the invalidator's function-code filter is defensive
|
|
rather than load-bearing.
|
|
|
|
## Bounded Capacity (LRU)
|
|
|
|
Each `ResponseCache` instance is capped at `Cache.MaxEntriesPerPlc`
|
|
(default 1000). When the dictionary is at the cap and a fresh insert
|
|
arrives, `EvictLeastRecentlyUsed` walks the entries and removes the one
|
|
with the smallest `CacheEntry.LastUsedTick`. The linear scan is
|
|
intentional — at 1000 entries the scan is cheaper than the network
|
|
round-trip the cache is saving, and a sorted secondary structure would
|
|
add complexity for no measurable win.
|
|
|
|
`LastUsedTick` is a monotonic 64-bit counter incremented on every hit and
|
|
every fresh insert. Using the counter rather than `DateTimeOffset.UtcNow`
|
|
keeps the hot path free of clock calls and survives wall-clock skew.
|
|
|
|
A background task drives proactive expiry. The constructor starts a
|
|
`PeriodicTimer` at `Cache.EvictionIntervalMs` (default 5000 ms; values
|
|
under 100 ms are clamped at 100 ms to prevent tight loops) and the
|
|
eviction loop sweeps every entry whose `ExpiresAtUtc` has passed. The
|
|
loop is the safety net that keeps abandoned entries — say, those for a
|
|
PLC whose upstream clients have all dropped — from holding memory until
|
|
process exit. Lazy expiry on `TryGet` still removes entries on demand
|
|
when traffic is steady; the background loop only matters under low- or
|
|
zero-traffic conditions.
|
|
|
|
## Long-TTL Safety Gate
|
|
|
|
`MbproxyOptionsValidator.ValidateCacheTtl` rejects any explicit
|
|
`CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`. The same gate
|
|
applies to `PlcOptions.DefaultCacheTtlMs`. The rejection runs at config
|
|
bind / hot-reload time, so a misconfigured `appsettings.json` fails fast
|
|
before the cache sees the value.
|
|
|
|
The gate exists to catch the "left at 1 hour by accident" mistake — a
|
|
deployment where a developer set `CacheTtlMs = 3_600_000` for a debugging
|
|
session and the value survived into production. Operators who legitimately
|
|
need long TTLs (slow-moving setpoints, configuration values that change
|
|
once per shift) flip `Cache.AllowLongTtl` to `true` as the explicit
|
|
acknowledgement that the long staleness window is intentional.
|
|
|
|
## Cache and the Rewriter
|
|
|
|
The BCD rewriter runs **once** on the cache-miss path: the backend reader
|
|
task decodes the response through `BcdPduPipeline` and only then hands the
|
|
decoded bytes to `ResponseCache.Set`. Cache hits return the stored
|
|
post-rewriter bytes directly.
|
|
|
|
This division has two consequences worth restating:
|
|
|
|
- **The rewriter cost is amortised across hits.** A high cache hit ratio
|
|
on a tag-dense PLC drops the per-request rewriter cost from "every
|
|
response" to "every cache-miss response," which on a hot register at
|
|
TTL=500 ms is one-in-many.
|
|
- **The cached payload is decoupled from the rewriter implementation.**
|
|
An entry stored under one rewriter does not get re-transformed if the
|
|
rewriter changes. Entries age out under TTL and are replaced by fresh
|
|
entries decoded under the current rewriter — there is no in-place
|
|
recomputation pass.
|
|
|
|
## Hot-Reload Semantics
|
|
|
|
Configuration changes propagate through `IOptionsMonitor<MbproxyOptions>`.
|
|
The cache reacts to four kinds of change:
|
|
|
|
| Change | Cache behaviour |
|
|
|--------|----------------|
|
|
| Tag's `CacheTtlMs` changed (`0 → N`, `N → 0`, `N → M`) | Entire PLC cache is flushed via `ResponseCache.Clear()`; entries re-populate on demand under the new TTL. |
|
|
| New PLC added / removed | New PLC starts with an empty cache; removed PLC's `ResponseCache` is disposed with the multiplexer. |
|
|
| `Cache.AllowLongTtl` flipped | Validation runs on the next reload only; existing entries are unaffected. |
|
|
| `Cache.MaxEntriesPerPlc` changed | Existing entries are unaffected; the new cap applies to subsequent inserts. |
|
|
| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues with its old period; subsequent loops use the new interval. |
|
|
|
|
Per-tag flush granularity is intentionally not implemented. The clean move
|
|
is "any tag-list change to a PLC → drop every entry for that PLC and let
|
|
the natural traffic re-populate." Tracking which keys correspond to which
|
|
tag IDs adds bookkeeping for no operational win — a tag-list reload is
|
|
already a once-in-a-while event, and the rebuild cost on the affected
|
|
PLC's hot keys is one round-trip per key under traffic.
|
|
|
|
See [`../Features/HotReload.md`](../Features/HotReload.md) for the
|
|
broader `IOptionsMonitor` propagation model.
|
|
|
|
## Cache Survives Backend Disconnects
|
|
|
|
A cached entry's data was valid when stored. A subsequent backend
|
|
disconnect does not retroactively invalidate it — the value the upstream
|
|
client sees on a hit is the value the PLC reported within the TTL
|
|
window, irrespective of whether the backend socket is up at the moment
|
|
of the hit. This is the cache's most operationally visible property
|
|
during PLC outages: upstream consumers that read hot tags within the
|
|
cache window continue to receive responses while the listener supervisor
|
|
is in `recovering` state.
|
|
|
|
The companion rule on the write side keeps the invariant consistent:
|
|
**invalidations during a `recovering` listener state are skipped**. If
|
|
the backend is down, an FC06 or FC16 write did not reach the PLC, so the
|
|
cached read is still consistent with the device's actual state. Skipping
|
|
the invalidation matches reality — the write did not take effect, so the
|
|
read is not stale.
|
|
|
|
## No Persistence
|
|
|
|
The cache is purely in-memory. Process restart wipes every entry. There
|
|
is no file-backed snapshot, no Redis or other external store, and no
|
|
last-known-good replay. A restarted service rebuilds its cache from
|
|
fresh backend round-trips driven by upstream traffic, exactly as it
|
|
would after a TTL-induced flush.
|
|
|
|
Intentional, for two reasons. First, the staleness contract is bounded
|
|
by `CacheTtlMs` measured from when the data was first read, and a
|
|
persisted entry would re-emerge with an unknown wall-clock age — every
|
|
invariant the cache offers would need a freshness field, freshness
|
|
arithmetic on load, and recovery against a clock that may have jumped.
|
|
Second, the operational model is that the proxy is a stateless
|
|
transformer; treating its cache as durable state would change the
|
|
deployment story for no measurable production benefit.
|
|
|
|
## Counter Accounting
|
|
|
|
`ProxyCounters` exposes five cache counters per PLC, surfaced on the
|
|
status page as both per-PLC and fleet-aggregate values:
|
|
|
|
- **`cacheHitCount`** — FC03/FC04 requests served from the cache. Bumped
|
|
inside `OnUpstreamFrameAsync` when `ResponseCache.TryGet` returns true.
|
|
- **`cacheMissCount`** — FC03/FC04 requests whose resolved TTL was
|
|
positive but whose key was not in the cache (or whose entry had
|
|
expired). The identity `cacheHitCount + cacheMissCount = total
|
|
cache-eligible FC03/FC04 requests` holds — reads whose effective TTL
|
|
is `0` (uncached) increment neither counter.
|
|
- **`cacheHitRatio`** — derived on the status page snapshot as
|
|
`cacheHitCount / (cacheHitCount + cacheMissCount)` when the
|
|
denominator is non-zero.
|
|
- **`cacheInvalidations`** — count of cache entries invalidated by
|
|
successful FC06/FC16 write responses, summed across writes.
|
|
- **`cacheEntryCount`** — point-in-time snapshot of
|
|
`ResponseCache.Count` (Tier-2 memory-watch KPI).
|
|
- **`cacheBytes`** — point-in-time approximation of cached PDU bytes,
|
|
computed as the running sum of `CacheEntry.Length` across entries
|
|
(Tier-2 memory-watch KPI).
|
|
|
|
The structured log events `mbproxy.cache.hit`, `mbproxy.cache.miss`,
|
|
`mbproxy.cache.store`, `mbproxy.cache.invalidated`, and
|
|
`mbproxy.cache.flushed` (defined in `CacheLogEvents`) mirror the counter
|
|
increments at Debug level for incident-time diagnosis. Counters are the
|
|
steady-state observability surface; the events are for tracing one
|
|
request through the cache when something looks wrong. See
|
|
[`../Operations/StatusPage.md`](../Operations/StatusPage.md) and
|
|
[`../Reference/LogEvents.md`](../Reference/LogEvents.md).
|
|
|
|
## Design-Contract Note
|
|
|
|
The cache changes the proxy's posture from "purely transparent except
|
|
for BCD rewriting" to "transparent by default, with an opt-in cache
|
|
layer." The transition is deliberate and operator-driven: setting
|
|
`CacheTtlMs > 0` on a tag is the explicit consent to the staleness
|
|
window, and a deployment that ships no positive TTLs is observationally
|
|
indistinguishable from one compiled without the cache code path.
|
|
|
|
There is no global switch, no implicit warm-up, and no behavioural
|
|
divergence from the transparent baseline until the operator opts in
|
|
tag-by-tag. The cache is the only place in the proxy where an upstream
|
|
read can resolve to a value that did not just round-trip the wire, and
|
|
its engagement is gated entirely by the per-tag and per-PLC TTL
|
|
configuration described above.
|
|
|
|
## Related Documentation
|
|
|
|
- [`./ConnectionModel.md`](./ConnectionModel.md) — TxId multiplexing,
|
|
correlation map, and the backend socket the cache short-circuits on a
|
|
hit.
|
|
- [`./ReadCoalescing.md`](./ReadCoalescing.md) — sits below the cache in
|
|
the lookup order; cache hits short-circuit coalescing entirely.
|
|
- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — the
|
|
`BcdPduPipeline` whose post-decode bytes the cache stores.
|
|
- [`../Features/HotReload.md`](../Features/HotReload.md) — the
|
|
`IOptionsMonitor` propagation that drives the per-PLC flush on
|
|
tag-list change.
|
|
- [`../Operations/Configuration.md`](../Operations/Configuration.md) —
|
|
binding for `BcdTagOptions.CacheTtlMs`,
|
|
`PlcOptions.DefaultCacheTtlMs`, and the `Cache` section
|
|
(`AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`).
|
|
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — exposes
|
|
`cacheHitCount`, `cacheMissCount`, `cacheHitRatio`,
|
|
`cacheInvalidations`, `cacheEntryCount`, and `cacheBytes`.
|
|
- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md)
|
|
— the operator view of cache-served reads while a backend is in
|
|
`recovering` state.
|
|
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — full
|
|
`mbproxy.cache.*` event catalogue with event IDs.
|
|
- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the
|
|
`pymodbus` DL205 stand-in used by the end-to-end cache tests.
|
|
- [`../design.md`](../design.md) — canonical design decisions and
|
|
rationale.
|