Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.
Architecture/
- Overview.md (150 lines) — listener topology, request flow, per-PLC isolation
- ConnectionModel.md (247 lines) — TxId multiplexer, watchdog, disconnect cascade
- ReadCoalescing.md (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
- ResponseCache.md (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation
Features/
- BcdRewriting.md (252 lines) — codec, CDAB, FC scope, partial-overlap policy
- HotReload.md (189 lines) — IOptionsMonitor + per-change-kind reconcile rules
Operations/
- Configuration.md (422 lines) — every Mbproxy:* option + validation rules
- StatusPage.md (334 lines) — admin endpoint surface, every JSON field
- Troubleshooting.md (364 lines) — diagnosis playbook keyed to log events
Reference/
- LogEvents.md (499 lines) — 28 events across 7 categories, grep-verified
Testing/
- Simulator.md (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk
Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.
Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
19 KiB
Response Cache
The response cache is an opt-in per-tag, bounded-staleness layer that serves FC03 and FC04 reads from in-process memory. It sits above read coalescing in the request path so a hit avoids both the coalescing entry and the backend round-trip entirely.
Cache Contract
The cache is off by default for every tag. CacheTtlMs = 0 on every BCD
tag is the default state, and a deployment that ships without any TTL
configuration behaves identically to one compiled without the cache at all
— no in-memory entries are created, every FC03/FC04 read falls through to
the coalescing-then-backend path, and counters that track cache activity
stay at zero.
Operators opt a tag in by setting a positive CacheTtlMs. That positive
value is the explicit acknowledgement of the staleness window: the operator
is stating, "I am willing for upstream clients to see a value up to N
milliseconds old in exchange for taking the read off the backend." There is
no implicit cache enablement. There is no global cache toggle that turns
caching on for previously-uncached tags. Every cached tag is one whose
configuration has a positive TTL on its line.
This stance is the design-contract pivot the cache introduces: before it, the proxy is purely transparent except for BCD rewriting. With the cache, the proxy is transparent by default, with an opt-in cache layer the operator can engage tag-by-tag.
TTL Resolution Order
Each FC03/FC04 read range resolves to one effective TTL through three tiers:
- Explicit per-tag.
BcdTagOptions.CacheTtlMson the tag entry. A non-null value wins regardless of the per-PLC default. An explicit0here disables caching for that tag even when the PLC default is positive. - Per-PLC default.
PlcOptions.DefaultCacheTtlMsapplies to any tag whose explicitCacheTtlMsisnull(unset). A0default means "no caching by default at this PLC." - Zero. With nothing set at either tier, the resolved TTL is
0and the read is uncached.
BcdTagMap.ResolveCacheTtlMs(startAddress, qty) implements the per-read
resolution. It enumerates the BCD tags whose register footprints intersect
the requested range and returns the smallest positive TTL across the hits,
or 0 if the range covers no configured tags.
public int ResolveCacheTtlMs(ushort startAddress, ushort qty)
{
if (!TryGetForRange(startAddress, qty, out var hits) || hits.Count == 0)
return 0;
int min = int.MaxValue;
foreach (var hit in hits)
{
int ttl = hit.Tag.CacheTtlMs;
if (ttl <= 0) return 0;
if (ttl < min) min = ttl;
}
return min == int.MaxValue ? 0 : min;
}
The hit.Tag.CacheTtlMs value resolved on each BcdTag already reflects
the explicit-then-default order — the options binder resolves the per-tag
override against the per-PLC default at config build time, so the runtime
hot path sees a single integer per tag.
Multi-Tag Range TTL Rule
When a single FC03/FC04 read covers multiple configured BCD tags, the effective TTL is the minimum across them:
range covers tags { A:TTL=500, B:TTL=2000, C:TTL=100 } → effective TTL = 100
range covers tags { A:TTL=500, B:TTL=0 (uncached) } → effective TTL = 0
range covers tags { A:TTL=500 } → effective TTL = 500
range covers no configured tags → effective TTL = 0
If any covered tag has CacheTtlMs = 0, the whole read is uncached. The
rationale is conservative-by-design: a multi-tag read whose narrowest TTL
is, for example, 100 ms cannot be served safely from an entry that was
stored under a tag with TTL 2 s, because that entry's freshness was only
guaranteed by the longer window. Rather than partition a range read across
heterogeneous TTLs or invent inheritance rules that an operator would have
to reason about per-deployment, the cache refuses to serve any multi-tag
read whose narrowest covered TTL is zero. Operators who want a tag cached
in isolation but uncached when read alongside an uncached neighbour get the
expected behaviour by leaving the neighbour at CacheTtlMs = 0.
A read whose range covers no configured BCD tags also resolves to 0.
There is nothing to be conservative about because the cache only serves
ranges that contain rewriter-tracked tags — a read of plain non-BCD
registers does not engage the cache regardless of any per-PLC default.
Lookup Order
The multiplexer's FC03/FC04 path consults three tiers in fixed order:
- Cache. When
_ctx.Cacheis wired andBcdTagMap.ResolveCacheTtlMsreturns a positive TTL for the read range,ResponseCache.TryGetis called against aCacheKey(unitId, fc, startAddress, qty). A hit splices the cached payload onto a fresh MBAP header carrying the original upstream TxId, pushes the frame onto that pipe's response channel, and returns without engaging coalescing or the backend at all. - Coalesce. On a cache miss (or when the resolved TTL is zero), the
request is offered to
InFlightByKeyMap.TryAttachOrCreate. A hit attaches the new party to a peer's in-flight request. - Backend. On a coalescing miss, the request opens a proxy TxId,
registers a
CorrelationMapentry, runs the BCD rewriter on any FC06 or FC16 payload, and queues the frame onto the outbound channel.
The cache check happens before the multiplexer's
EnsureBackendConnectedAsync call. A cache hit serves the upstream even
when the backend socket is currently disconnected or recovering. This is
not an accident — the cached payload's freshness is bounded by its TTL,
not by the liveness of the backend socket. See
../Operations/Troubleshooting.md for
the operator view of cache-served reads during a backend outage.
Storage Format: Post-Rewriter Bytes
CacheEntry.PduBytes holds the post-rewriter response PDU body — the
function code byte, the byte count, and the rewriter-decoded register
data, with no MBAP header. The backend reader task decodes the response
through BcdPduPipeline first and only then hands the rewritten payload
to ResponseCache.Set.
internal sealed record CacheEntry(
byte[] PduBytes,
DateTimeOffset CachedAtUtc,
DateTimeOffset ExpiresAtUtc,
int Length,
long LastUsedTick);
Storing post-rewriter bytes is both a CPU optimisation and a correctness guarantee:
- CPU. A cache hit returns ready-to-send bytes. The rewriter does not re-run per hit; only the MBAP header is regenerated to carry the upstream's original TxId.
- Correctness. An entry decoded against an earlier rewriter version never gets retroactively re-transformed against a newer version. If the rewriter's behaviour changes mid-process (it does not today, but the guarantee is durable across future changes), in-flight cached entries age out under their TTL and are replaced by fresh entries decoded through the new rewriter. A bidirectional re-encode never happens to an already-stored entry.
Write Invalidation by Address Range Overlap
A successful (non-exception) FC06 or FC16 response invalidates every
cached FC03 or FC04 entry whose address range
[StartAddress, StartAddress + Qty) overlaps the write range
[writeStart, writeStart + writeQty). The pure overlap math lives in
CacheInvalidator.FindOverlapping:
int writeEnd = writeStart + writeQty; // half-open upper bound
foreach (var key in haystack)
{
if (key.UnitId != unitId) continue;
if (key.Fc != 0x03 && key.Fc != 0x04) continue;
int keyEnd = key.StartAddress + key.Qty;
// Overlap iff writeStart < keyEnd AND key.StartAddress < writeEnd.
if (writeStart < keyEnd && key.StartAddress < writeEnd)
hits.Add(key);
}
Worked examples on a single unit ID:
Write to register 105 (qty=1)
└─ invalidates cached FC03 [100..110) — register 105 is inside the cached range
└─ leaves cached FC03 [200..210) untouched
Write to registers [10..15) (qty=5)
└─ leaves cached FC03 [15..20) untouched — half-open intervals, 15 is not in [10..15)
Write to registers [98..108) (qty=10)
└─ invalidates cached FC03 [100..110) — ranges overlap on [100..108)
Three properties of the invalidator deserve calling out:
- Exception responses do not invalidate. A Modbus exception (code 01, 02, 03, 04, or any other) means the write did not take effect on the PLC. The cached read is still consistent with the device, so the invalidator is not engaged.
- Different unit IDs never invalidate each other. Multi-drop and
gateway personalities behind a shared socket address logically separate
Modbus tables.
CacheKey.UnitIddiscriminates. - Only FC03 and FC04 entries are evicted. The cache never stores write responses, so the invalidator's function-code filter is defensive rather than load-bearing.
Bounded Capacity (LRU)
Each ResponseCache instance is capped at Cache.MaxEntriesPerPlc
(default 1000). When the dictionary is at the cap and a fresh insert
arrives, EvictLeastRecentlyUsed walks the entries and removes the one
with the smallest CacheEntry.LastUsedTick. The linear scan is
intentional — at 1000 entries the scan is cheaper than the network
round-trip the cache is saving, and a sorted secondary structure would
add complexity for no measurable win.
LastUsedTick is a monotonic 64-bit counter incremented on every hit and
every fresh insert. Using the counter rather than DateTimeOffset.UtcNow
keeps the hot path free of clock calls and survives wall-clock skew.
A background task drives proactive expiry. The constructor starts a
PeriodicTimer at Cache.EvictionIntervalMs (default 5000 ms; values
under 100 ms are clamped at 100 ms to prevent tight loops) and the
eviction loop sweeps every entry whose ExpiresAtUtc has passed. The
loop is the safety net that keeps abandoned entries — say, those for a
PLC whose upstream clients have all dropped — from holding memory until
process exit. Lazy expiry on TryGet still removes entries on demand
when traffic is steady; the background loop only matters under low- or
zero-traffic conditions.
Long-TTL Safety Gate
MbproxyOptionsValidator.ValidateCacheTtl rejects any explicit
CacheTtlMs > 60_000 unless Cache.AllowLongTtl = true. The same gate
applies to PlcOptions.DefaultCacheTtlMs. The rejection runs at config
bind / hot-reload time, so a misconfigured appsettings.json fails fast
before the cache sees the value.
The gate exists to catch the "left at 1 hour by accident" mistake — a
deployment where a developer set CacheTtlMs = 3_600_000 for a debugging
session and the value survived into production. Operators who legitimately
need long TTLs (slow-moving setpoints, configuration values that change
once per shift) flip Cache.AllowLongTtl to true as the explicit
acknowledgement that the long staleness window is intentional.
Cache and the Rewriter
The BCD rewriter runs once on the cache-miss path: the backend reader
task decodes the response through BcdPduPipeline and only then hands the
decoded bytes to ResponseCache.Set. Cache hits return the stored
post-rewriter bytes directly.
This division has two consequences worth restating:
- The rewriter cost is amortised across hits. A high cache hit ratio on a tag-dense PLC drops the per-request rewriter cost from "every response" to "every cache-miss response," which on a hot register at TTL=500 ms is one-in-many.
- The cached payload is decoupled from the rewriter implementation. An entry stored under one rewriter does not get re-transformed if the rewriter changes. Entries age out under TTL and are replaced by fresh entries decoded under the current rewriter — there is no in-place recomputation pass.
Hot-Reload Semantics
Configuration changes propagate through IOptionsMonitor<MbproxyOptions>.
The cache reacts to four kinds of change:
| Change | Cache behaviour |
|---|---|
Tag's CacheTtlMs changed (0 → N, N → 0, N → M) |
Entire PLC cache is flushed via ResponseCache.Clear(); entries re-populate on demand under the new TTL. |
| New PLC added / removed | New PLC starts with an empty cache; removed PLC's ResponseCache is disposed with the multiplexer. |
Cache.AllowLongTtl flipped |
Validation runs on the next reload only; existing entries are unaffected. |
Cache.MaxEntriesPerPlc changed |
Existing entries are unaffected; the new cap applies to subsequent inserts. |
Cache.EvictionIntervalMs changed |
Existing eviction loop continues with its old period; subsequent loops use the new interval. |
Per-tag flush granularity is intentionally not implemented. The clean move is "any tag-list change to a PLC → drop every entry for that PLC and let the natural traffic re-populate." Tracking which keys correspond to which tag IDs adds bookkeeping for no operational win — a tag-list reload is already a once-in-a-while event, and the rebuild cost on the affected PLC's hot keys is one round-trip per key under traffic.
See ../Features/HotReload.md for the
broader IOptionsMonitor propagation model.
Cache Survives Backend Disconnects
A cached entry's data was valid when stored. A subsequent backend
disconnect does not retroactively invalidate it — the value the upstream
client sees on a hit is the value the PLC reported within the TTL
window, irrespective of whether the backend socket is up at the moment
of the hit. This is the cache's most operationally visible property
during PLC outages: upstream consumers that read hot tags within the
cache window continue to receive responses while the listener supervisor
is in recovering state.
The companion rule on the write side keeps the invariant consistent:
invalidations during a recovering listener state are skipped. If
the backend is down, an FC06 or FC16 write did not reach the PLC, so the
cached read is still consistent with the device's actual state. Skipping
the invalidation matches reality — the write did not take effect, so the
read is not stale.
No Persistence
The cache is purely in-memory. Process restart wipes every entry. There is no file-backed snapshot, no Redis or other external store, and no last-known-good replay. A restarted service rebuilds its cache from fresh backend round-trips driven by upstream traffic, exactly as it would after a TTL-induced flush.
Intentional, for two reasons. First, the staleness contract is bounded
by CacheTtlMs measured from when the data was first read, and a
persisted entry would re-emerge with an unknown wall-clock age — every
invariant the cache offers would need a freshness field, freshness
arithmetic on load, and recovery against a clock that may have jumped.
Second, the operational model is that the proxy is a stateless
transformer; treating its cache as durable state would change the
deployment story for no measurable production benefit.
Counter Accounting
ProxyCounters exposes five cache counters per PLC, surfaced on the
status page as both per-PLC and fleet-aggregate values:
cacheHitCount— FC03/FC04 requests served from the cache. Bumped insideOnUpstreamFrameAsyncwhenResponseCache.TryGetreturns true.cacheMissCount— FC03/FC04 requests whose resolved TTL was positive but whose key was not in the cache (or whose entry had expired). The identitycacheHitCount + cacheMissCount = total cache-eligible FC03/FC04 requestsholds — reads whose effective TTL is0(uncached) increment neither counter.cacheHitRatio— derived on the status page snapshot ascacheHitCount / (cacheHitCount + cacheMissCount)when the denominator is non-zero.cacheInvalidations— count of cache entries invalidated by successful FC06/FC16 write responses, summed across writes.cacheEntryCount— point-in-time snapshot ofResponseCache.Count(Tier-2 memory-watch KPI).cacheBytes— point-in-time approximation of cached PDU bytes, computed as the running sum ofCacheEntry.Lengthacross entries (Tier-2 memory-watch KPI).
The structured log events mbproxy.cache.hit, mbproxy.cache.miss,
mbproxy.cache.store, mbproxy.cache.invalidated, and
mbproxy.cache.flushed (defined in CacheLogEvents) mirror the counter
increments at Debug level for incident-time diagnosis. Counters are the
steady-state observability surface; the events are for tracing one
request through the cache when something looks wrong. See
../Operations/StatusPage.md and
../Reference/LogEvents.md.
Design-Contract Note
The cache changes the proxy's posture from "purely transparent except
for BCD rewriting" to "transparent by default, with an opt-in cache
layer." The transition is deliberate and operator-driven: setting
CacheTtlMs > 0 on a tag is the explicit consent to the staleness
window, and a deployment that ships no positive TTLs is observationally
indistinguishable from one compiled without the cache code path.
There is no global switch, no implicit warm-up, and no behavioural divergence from the transparent baseline until the operator opts in tag-by-tag. The cache is the only place in the proxy where an upstream read can resolve to a value that did not just round-trip the wire, and its engagement is gated entirely by the per-tag and per-PLC TTL configuration described above.
Related Documentation
./ConnectionModel.md— TxId multiplexing, correlation map, and the backend socket the cache short-circuits on a hit../ReadCoalescing.md— sits below the cache in the lookup order; cache hits short-circuit coalescing entirely.../Features/BcdRewriting.md— theBcdPduPipelinewhose post-decode bytes the cache stores.../Features/HotReload.md— theIOptionsMonitorpropagation that drives the per-PLC flush on tag-list change.../Operations/Configuration.md— binding forBcdTagOptions.CacheTtlMs,PlcOptions.DefaultCacheTtlMs, and theCachesection (AllowLongTtl,MaxEntriesPerPlc,EvictionIntervalMs).../Operations/StatusPage.md— exposescacheHitCount,cacheMissCount,cacheHitRatio,cacheInvalidations,cacheEntryCount, andcacheBytes.../Operations/Troubleshooting.md— the operator view of cache-served reads while a backend is inrecoveringstate.../Reference/LogEvents.md— fullmbproxy.cache.*event catalogue with event IDs.../Testing/Simulator.md— thepymodbusDL205 stand-in used by the end-to-end cache tests.../design.md— canonical design decisions and rationale.