Eight area-focused reviews (BCD rewriter, multiplexer, response cache, supervisor + hot-reload, admin + diagnostics, hosting + options, test suite) plus an Overview that prioritises findings across areas, and a RemediationPlan that groups the work into three waves with per-item file:line citations and regression-test sketches. Findings call out: hot-reload tag-list/cache changes that don't reach the running multiplexer, a coalescing factory leak that hangs late attachers, backend-reader head-of-line block on a wedged upstream, stranded outbound frames after cascade, and ShutdownCoordinator double-stop ordering. Plus the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB, not standard binary), unreachable BcdValidationError.DuplicateAddress, mbproxy.cache.flushed event that's defined but never emitted, and missing test coverage for Cache.AllowLongTtl. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
10 KiB
Phase 11 Response Cache — Code Review
Scope: src/Mbproxy/Proxy/Cache/*, BcdTagOptions.CacheTtlMs, PlcOptions.DefaultCacheTtlMs, MbproxyOptions.CacheOptions, Configuration/ReloadValidator.cs (long-TTL gate), and the wire-up in PlcMultiplexer.cs (cache→coalesce→backend lookup order, post-rewriter store, write invalidation). Cross-checked against docs/design.md "Response cache (Phase 11)", docs/Architecture/ResponseCache.md, docs/plan/11-response-cache.md. Tests under tests/Mbproxy.Tests/Proxy/Cache/*.
Summary
- The cache contract is implemented correctly in the load-bearing places: lookup-order, post-rewriter store, address-range overlap invalidation, exception-skip, unit-id isolation, multi-tag
min(TTL)with the conservative "any TTL=0 disables the range" rule, and pre-backend-connect short-circuit on hit. - TTL resolution is split across the options binder (per-tag/per-PLC fold at build time,
BcdTagMapBuilder.Buildlines 107-110) and the per-read collapse inBcdTagMap.ResolveCacheTtlMs(lines 65-78). The hot path sees a single integer per tag. - Concurrency is a single coarse
lock(ResponseCache.cs:30); LRU is a 64-bit monotonic ticker rather than a clock; LRU eviction is a linear scan. - Several lower-priority gaps exist:
mbproxy.cache.flushedevent is never emitted, write-invalidation runs the overlap matcher against a snapshot of all keys (no UnitId/FC pre-filter), and the long-TTL gate is enforced only against config-shaped TTLs (not against the resolved per-tag values surfaced to the runtime).
Critical findings
None. No correctness, race, or stale-data leak found that violates the design contract.
Major findings
-
mbproxy.cache.flushedevent never emitted. Defined insrc/Mbproxy/Proxy/Cache/CacheLogEvents.cs:67-76and referenced in the design doc (docs/design.md:225,docs/Architecture/ResponseCache.md:347) anddocs/Reference/LogEvents.md:374, but no caller exists (Grep matches: only the definition). The hot-reload PLC-cache flush inPlcListenerSupervisor.ReplaceContextAsync(src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:199-216) drops the old cache viaDispose()rather thanClear()(so the bytes/entry-count drop is implicit). Operators reading the docs will look for an event that never appears. -
Long-TTL gate is enforced only against
BcdTagOptions.CacheTtlMsandPlcOptions.DefaultCacheTtlMsat config-shape time.MbproxyOptionsValidator.ValidateCacheTtl(src/Mbproxy/Options/MbproxyOptions.cs:102-112) andReloadValidator.CheckTtl(src/Mbproxy/Configuration/ReloadValidator.cs:120-129) inspect the raw options. After the resolver folds the per-PLC default into per-tag values (BcdTagMapBuilder.cs:107), a tag whose explicitCacheTtlMsis null and whose PLC default is60_001would pass each individual gate (the per-PLC default check on line 97 does catch this), but a tag whose explicit value is30_000and inherited default would also be 30_000 — so the explicit interaction is fine. The resolvedBcdTag.CacheTtlMsis, however, not re-checked anywhere — consistent with intent, but worth flagging if the per-tag fold ever grows arithmetic (e.g. clamping or summing). Note:ReloadValidatoralso runsBcdTagMapBuilder.Build(line 80) and discards its results — the resolved TTLs are never re-validated againstAllowLongTtl. -
Hot-reload flush has no test coverage. No test in
tests/Mbproxy.Tests/Proxy/Cache/*exercisesReplaceContextAsyncor the dispose-old-cache path. Given that this is the only mechanism that satisfies the "any tag-list change drops the entire PLC cache" rule, a test would be valuable. (See also the Supervisor review's Critical 1 — the running multiplexer captures_ctxat construction and never re-reads, so even when the supervisor swaps_currentContextthe running mux keeps using the old cache. Until that is fixed, the "flush" semantics are only realised when the listener is later restarted by Polly.) -
Cache survives backend disconnect — not directly tested for invalidation skip. The
FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulatedtest (ResponseCacheMultiplexerTests.cs:506-543) covers the read side. There is no test for the design clause "Invalidations during arecoveringlistener state are skipped" (docs/Architecture/ResponseCache.md:299-304). The implementation gates this implicitly: invalidation only runs inRunBackendReaderAsyncafter a successful FC06/FC16 response arrives (PlcMultiplexer.cs:497-509), and aRecoveringstate means no backend reader → no response → no invalidation. The implicit gating is correct but fragile and undocumented in code comments.
Minor findings
-
Invalidatesnapshots ALL keys with_entries.Keys.ToArray()then filters by UnitId/FC insideCacheInvalidator.FindOverlapping.ResponseCache.cs:157. For a 1000-entry cache on a multi-unit gateway, every write allocates a 1000-element array and walks it. An obvious micro-optimisation is to filter inside the lock with the matcher. Acceptable at current scale (54 PLCs, 1000 entries cap, low write rate). -
Dispose()doesn't drain_entries.ResponseCache.cs:190-202cancels the eviction loop but leaves entry buffers reachable until GC. Not a correctness issue; just worth aClear()call. -
SweepExpiredallocates a freshList<CacheKey>every tick.ResponseCache.cs:261. WithEvictionIntervalMs = 5000and typical low expiry counts this is one allocation per 5 s — fine — but a reusable pooled list (or two-phase enumerate-and-conditional-remove usingDictionary.Remove's 1-pass nature in newer runtimes) would eliminate it. -
BuildCacheHitFramedoesn't validatecachedPdu.Lengthagainst the MBAP 253-byte cap.PlcMultiplexer.cs:952-966. Cache stores are guarded upstream (the backend reader enforcesMaxPduBodySize), but the Length field arithmetic on line 956 silently overflows aushortfor >65534-byte PDUs. Defensive only. -
elapsedMscomputed but unused.PlcMultiplexer.cs:445. Dead code. -
MaxEntriesPerPlc = 0semantics.Setis no-op (line 122) but the eviction task still runs everyEvictionIntervalMs. Operators who set this to 0 to disable the cache will pay an idle timer cost. Document or short-circuit constructor. -
Cache.AllowLongTtlis documented as "reload-time check" but is also enforced at schema-validation time (MbproxyOptions.cs:74-77,ReloadValidator.cs:97-100). Both layers agree but the duplication invites drift if one is updated.
What looks good
- Lookup order: The cache check at
PlcMultiplexer.cs:610-629happens beforeEnsureBackendConnectedAsync(line 634) and before the coalescingTryAttachOrCreateblock (line 661). A hit returns immediately without creating anInFlightByKeyMapentry — verifiable byCacheHit_ShortCircuits_Coalescingtest (ResponseCacheMultiplexerTests.cs:304-343). - Post-rewriter store ordering:
_pipeline.Process(MbapDirection.ResponseToClient, ...)runs atPlcMultiplexer.cs:455-459, then the cacheSetat line 490. TheBcdDecodedBytes_AreCached_NotRawBcdtest verifies this end-to-end. - Multi-tag conservative rule:
BcdTagMap.ResolveCacheTtlMsreturns 0 immediately on encountering anyttl <= 0(line 74), bailing the loop. Verified byMultiTagRange_AnyZeroTtl_DisablesCachingForWholeReadtest. - Invalidator scope:
CacheInvalidator.FindOverlapping(CacheInvalidator.cs:43-49) explicitly checks both FC03 and FC04, and tests cover both adjacent-but-not-overlapping (register 15 case) and full/partial/disjoint overlap cases (CacheInvalidatorTests.cs). Different unitIds isolated correctly. - Exception skip:
PlcMultiplexer.cs:469-471checks(fcInResponse & 0x80) != 0before either store or invalidate. Consistent with the design clause "exception responses do not invalidate." - Counter parity: Cache hit/miss only increment when
resolvedCacheTtlMs > 0(PlcMultiplexer.cs:613), so uncached reads (TTL=0) bump neither, satisfying theHit + Miss = cache-eligibleidentity. Tested byMultiTagRange_AnyZeroTtl_DisablesCachingForWholeReadasserting both counters at 0. - LRU correctness: Tested by
LRU_TracksAccessOrder_AcrossGetAndSet. Monotonic-counter design (CacheEntry.cs:14-16) is robust to wall-clock skew and clock calls. - Memory accounting:
_approxBytesdecremented on every removal path (lazy expiry line 103, eviction line 228, sweep line 272, invalidate line 164, replace line 132, clear line 182). Tested byApproximateBytes_TracksSetReplaceAndInvalidate. - Write invalidation in multiplexer is FC04-aware: Even though writes only land on holding registers,
CacheInvalidatorevicts both FC03 and FC04 keys at the same address, matching the design's(unitId, FC ∈ {3,4})scope. - FC04 cacheability:
PlcMultiplexer.cs:610and:473both checkis 0x03 or 0x04— input registers are cacheable, not just holding. - Pre-existing cache survives backend down:
EnsureBackendConnectedAsyncis called only after the cache check returns false (line 634);FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulatedverifies.
Open questions
- The design specifies
mbproxy.cache.flushedevents withReasonpropagation. Without a caller, what's the intended emission point — inReplaceContextAsync, in the supervisor's reseat handler, or somewhere inConfigReconciler? SeeCacheLogEvents.cs:67-76. - Is
cacheBytesintended to reflect dictionary overhead (key+entry headers) or only PDU bytes? Currently only PDU bytes (theLengthfield). For a 1000-entry cap with small PDUs, dictionary overhead may dominate actual heap; this matters for the "Tier-2 memory-watch KPI" framing. - When a tag-list reload would change
MaxEntriesPerPlc(theCache.MaxEntriesPerPlcknob, not per-tag TTL), the design says "Existing entries unaffected; cap applies to subsequent inserts." ButReplaceContextAsyncruns only on per-PLC tag-list change — aMaxEntriesPerPlcchange alone does NOT trigger a context swap, so the runningResponseCachekeeps its constructor-time cap. Is this acceptable, or shouldCacheknob changes propagate live? - Does the
BcdTagMap.ResolveCacheTtlMsallocation behaviour (allocates the hit list when any tag is in range, then immediately discards on TTL=0) need optimising for the hot path? At ~54 PLCs and modest read rates probably no, but at 1000+ FC03/s per PLC it would matter.