Files
wwtools/mbproxy/codereviews/2026-05-14/ResponseCache.md
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

62 lines
10 KiB
Markdown

# Phase 11 Response Cache — Code Review
Scope: `src/Mbproxy/Proxy/Cache/*`, `BcdTagOptions.CacheTtlMs`, `PlcOptions.DefaultCacheTtlMs`, `MbproxyOptions.CacheOptions`, `Configuration/ReloadValidator.cs` (long-TTL gate), and the wire-up in `PlcMultiplexer.cs` (cache→coalesce→backend lookup order, post-rewriter store, write invalidation). Cross-checked against `docs/design.md` "Response cache (Phase 11)", `docs/Architecture/ResponseCache.md`, `docs/plan/11-response-cache.md`. Tests under `tests/Mbproxy.Tests/Proxy/Cache/*`.
## Summary
- The cache contract is implemented correctly in the load-bearing places: lookup-order, post-rewriter store, address-range overlap invalidation, exception-skip, unit-id isolation, multi-tag `min(TTL)` with the conservative "any TTL=0 disables the range" rule, and pre-backend-connect short-circuit on hit.
- TTL resolution is split across the options binder (per-tag/per-PLC fold at build time, `BcdTagMapBuilder.Build` lines 107-110) and the per-read collapse in `BcdTagMap.ResolveCacheTtlMs` (lines 65-78). The hot path sees a single integer per tag.
- Concurrency is a single coarse `lock` (`ResponseCache.cs:30`); LRU is a 64-bit monotonic ticker rather than a clock; LRU eviction is a linear scan.
- Several lower-priority gaps exist: `mbproxy.cache.flushed` event is never emitted, write-invalidation runs the overlap matcher against a snapshot of *all* keys (no UnitId/FC pre-filter), and the long-TTL gate is enforced only against config-shaped TTLs (not against the resolved per-tag values surfaced to the runtime).
## Critical findings
**None.** No correctness, race, or stale-data leak found that violates the design contract.
## Major findings
1. **`mbproxy.cache.flushed` event never emitted.** Defined in `src/Mbproxy/Proxy/Cache/CacheLogEvents.cs:67-76` and referenced in the design doc (`docs/design.md:225`, `docs/Architecture/ResponseCache.md:347`) and `docs/Reference/LogEvents.md:374`, but no caller exists (Grep matches: only the definition). The hot-reload PLC-cache flush in `PlcListenerSupervisor.ReplaceContextAsync` (`src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:199-216`) drops the old cache via `Dispose()` rather than `Clear()` (so the bytes/entry-count drop is implicit). Operators reading the docs will look for an event that never appears.
2. **Long-TTL gate is enforced only against `BcdTagOptions.CacheTtlMs` and `PlcOptions.DefaultCacheTtlMs` at config-shape time.** `MbproxyOptionsValidator.ValidateCacheTtl` (`src/Mbproxy/Options/MbproxyOptions.cs:102-112`) and `ReloadValidator.CheckTtl` (`src/Mbproxy/Configuration/ReloadValidator.cs:120-129`) inspect the *raw* options. After the resolver folds the per-PLC default into per-tag values (`BcdTagMapBuilder.cs:107`), a tag whose explicit `CacheTtlMs` is null and whose PLC default is `60_001` would pass each individual gate (the per-PLC default check on line 97 does catch this), but a tag whose explicit value is `30_000` and inherited default would also be 30_000 — so the explicit interaction is fine. The resolved `BcdTag.CacheTtlMs` is, however, not re-checked anywhere — consistent with intent, but worth flagging if the per-tag fold ever grows arithmetic (e.g. clamping or summing). Note: `ReloadValidator` also runs `BcdTagMapBuilder.Build` (line 80) and discards its results — the resolved TTLs are never re-validated against `AllowLongTtl`.
3. **Hot-reload flush has no test coverage.** No test in `tests/Mbproxy.Tests/Proxy/Cache/*` exercises `ReplaceContextAsync` or the dispose-old-cache path. Given that this is the only mechanism that satisfies the "any tag-list change drops the entire PLC cache" rule, a test would be valuable. (See also the Supervisor review's Critical 1 — the running multiplexer captures `_ctx` at construction and never re-reads, so even when the supervisor swaps `_currentContext` the running mux keeps using the old cache. Until that is fixed, the "flush" semantics are only realised when the listener is later restarted by Polly.)
4. **Cache survives backend disconnect — not directly tested for invalidation skip.** The `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated` test (`ResponseCacheMultiplexerTests.cs:506-543`) covers the read side. There is no test for the design clause "Invalidations during a `recovering` listener state are skipped" (`docs/Architecture/ResponseCache.md:299-304`). The implementation gates this implicitly: invalidation only runs in `RunBackendReaderAsync` after a successful FC06/FC16 *response* arrives (`PlcMultiplexer.cs:497-509`), and a `Recovering` state means no backend reader → no response → no invalidation. The implicit gating is correct but fragile and undocumented in code comments.
## Minor findings
5. **`Invalidate` snapshots ALL keys with `_entries.Keys.ToArray()` then filters by UnitId/FC inside `CacheInvalidator.FindOverlapping`.** `ResponseCache.cs:157`. For a 1000-entry cache on a multi-unit gateway, every write allocates a 1000-element array and walks it. An obvious micro-optimisation is to filter inside the lock with the matcher. Acceptable at current scale (54 PLCs, 1000 entries cap, low write rate).
6. **`Dispose()` doesn't drain `_entries`.** `ResponseCache.cs:190-202` cancels the eviction loop but leaves entry buffers reachable until GC. Not a correctness issue; just worth a `Clear()` call.
7. **`SweepExpired` allocates a fresh `List<CacheKey>` every tick.** `ResponseCache.cs:261`. With `EvictionIntervalMs = 5000` and typical low expiry counts this is one allocation per 5 s — fine — but a reusable pooled list (or two-phase enumerate-and-conditional-remove using `Dictionary.Remove`'s 1-pass nature in newer runtimes) would eliminate it.
8. **`BuildCacheHitFrame` doesn't validate `cachedPdu.Length` against the MBAP 253-byte cap.** `PlcMultiplexer.cs:952-966`. Cache stores are guarded upstream (the backend reader enforces `MaxPduBodySize`), but the Length field arithmetic on line 956 silently overflows a `ushort` for >65534-byte PDUs. Defensive only.
9. **`elapsedMs` computed but unused.** `PlcMultiplexer.cs:445`. Dead code.
10. **`MaxEntriesPerPlc = 0` semantics.** `Set` is no-op (line 122) but the eviction task still runs every `EvictionIntervalMs`. Operators who set this to 0 to disable the cache will pay an idle timer cost. Document or short-circuit constructor.
11. **`Cache.AllowLongTtl` is documented as "reload-time check" but is also enforced at schema-validation time** (`MbproxyOptions.cs:74-77`, `ReloadValidator.cs:97-100`). Both layers agree but the duplication invites drift if one is updated.
## What looks good
- **Lookup order:** The cache check at `PlcMultiplexer.cs:610-629` happens before `EnsureBackendConnectedAsync` (line 634) and before the coalescing `TryAttachOrCreate` block (line 661). A hit returns immediately without creating an `InFlightByKeyMap` entry — verifiable by `CacheHit_ShortCircuits_Coalescing` test (`ResponseCacheMultiplexerTests.cs:304-343`).
- **Post-rewriter store ordering:** `_pipeline.Process(MbapDirection.ResponseToClient, ...)` runs at `PlcMultiplexer.cs:455-459`, then the cache `Set` at line 490. The `BcdDecodedBytes_AreCached_NotRawBcd` test verifies this end-to-end.
- **Multi-tag conservative rule:** `BcdTagMap.ResolveCacheTtlMs` returns 0 immediately on encountering any `ttl <= 0` (line 74), bailing the loop. Verified by `MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead` test.
- **Invalidator scope:** `CacheInvalidator.FindOverlapping` (`CacheInvalidator.cs:43-49`) explicitly checks both FC03 and FC04, and tests cover both adjacent-but-not-overlapping (register 15 case) and full/partial/disjoint overlap cases (`CacheInvalidatorTests.cs`). Different unitIds isolated correctly.
- **Exception skip:** `PlcMultiplexer.cs:469-471` checks `(fcInResponse & 0x80) != 0` before either store or invalidate. Consistent with the design clause "exception responses do not invalidate."
- **Counter parity:** Cache hit/miss only increment when `resolvedCacheTtlMs > 0` (`PlcMultiplexer.cs:613`), so uncached reads (TTL=0) bump neither, satisfying the `Hit + Miss = cache-eligible` identity. Tested by `MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead` asserting both counters at 0.
- **LRU correctness:** Tested by `LRU_TracksAccessOrder_AcrossGetAndSet`. Monotonic-counter design (`CacheEntry.cs:14-16`) is robust to wall-clock skew and clock calls.
- **Memory accounting:** `_approxBytes` decremented on every removal path (lazy expiry line 103, eviction line 228, sweep line 272, invalidate line 164, replace line 132, clear line 182). Tested by `ApproximateBytes_TracksSetReplaceAndInvalidate`.
- **Write invalidation in multiplexer is FC04-aware:** Even though writes only land on holding registers, `CacheInvalidator` evicts both FC03 and FC04 keys at the same address, matching the design's `(unitId, FC ∈ {3,4})` scope.
- **FC04 cacheability:** `PlcMultiplexer.cs:610` and `:473` both check `is 0x03 or 0x04` — input registers are cacheable, not just holding.
- **Pre-existing cache survives backend down:** `EnsureBackendConnectedAsync` is called only after the cache check returns false (line 634); `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated` verifies.
## Open questions
1. The design specifies `mbproxy.cache.flushed` events with `Reason` propagation. Without a caller, what's the intended emission point — in `ReplaceContextAsync`, in the supervisor's reseat handler, or somewhere in `ConfigReconciler`? See `CacheLogEvents.cs:67-76`.
2. Is `cacheBytes` intended to reflect dictionary overhead (key+entry headers) or only PDU bytes? Currently only PDU bytes (the `Length` field). For a 1000-entry cap with small PDUs, dictionary overhead may dominate actual heap; this matters for the "Tier-2 memory-watch KPI" framing.
3. When a tag-list reload would change `MaxEntriesPerPlc` (the `Cache.MaxEntriesPerPlc` knob, not per-tag TTL), the design says "Existing entries unaffected; cap applies to subsequent inserts." But `ReplaceContextAsync` runs only on per-PLC tag-list change — a `MaxEntriesPerPlc` change alone does NOT trigger a context swap, so the running `ResponseCache` keeps its constructor-time cap. Is this acceptable, or should `Cache` knob changes propagate live?
4. Does the `BcdTagMap.ResolveCacheTtlMs` allocation behaviour (allocates the hit list when *any* tag is in range, then immediately discards on TTL=0) need optimising for the hot path? At ~54 PLCs and modest read rates probably no, but at 1000+ FC03/s per PLC it would matter.