Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

10 KiB

Phase 11 Response Cache — Code Review

Scope: src/Mbproxy/Proxy/Cache/*, BcdTagOptions.CacheTtlMs, PlcOptions.DefaultCacheTtlMs, MbproxyOptions.CacheOptions, Configuration/ReloadValidator.cs (long-TTL gate), and the wire-up in PlcMultiplexer.cs (cache→coalesce→backend lookup order, post-rewriter store, write invalidation). Cross-checked against docs/design.md "Response cache (Phase 11)", docs/Architecture/ResponseCache.md, docs/plan/11-response-cache.md. Tests under tests/Mbproxy.Tests/Proxy/Cache/*.

Summary

  • The cache contract is implemented correctly in the load-bearing places: lookup-order, post-rewriter store, address-range overlap invalidation, exception-skip, unit-id isolation, multi-tag min(TTL) with the conservative "any TTL=0 disables the range" rule, and pre-backend-connect short-circuit on hit.
  • TTL resolution is split across the options binder (per-tag/per-PLC fold at build time, BcdTagMapBuilder.Build lines 107-110) and the per-read collapse in BcdTagMap.ResolveCacheTtlMs (lines 65-78). The hot path sees a single integer per tag.
  • Concurrency is a single coarse lock (ResponseCache.cs:30); LRU is a 64-bit monotonic ticker rather than a clock; LRU eviction is a linear scan.
  • Several lower-priority gaps exist: mbproxy.cache.flushed event is never emitted, write-invalidation runs the overlap matcher against a snapshot of all keys (no UnitId/FC pre-filter), and the long-TTL gate is enforced only against config-shaped TTLs (not against the resolved per-tag values surfaced to the runtime).

Critical findings

None. No correctness, race, or stale-data leak found that violates the design contract.

Major findings

  1. mbproxy.cache.flushed event never emitted. Defined in src/Mbproxy/Proxy/Cache/CacheLogEvents.cs:67-76 and referenced in the design doc (docs/design.md:225, docs/Architecture/ResponseCache.md:347) and docs/Reference/LogEvents.md:374, but no caller exists (Grep matches: only the definition). The hot-reload PLC-cache flush in PlcListenerSupervisor.ReplaceContextAsync (src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs:199-216) drops the old cache via Dispose() rather than Clear() (so the bytes/entry-count drop is implicit). Operators reading the docs will look for an event that never appears.

  2. Long-TTL gate is enforced only against BcdTagOptions.CacheTtlMs and PlcOptions.DefaultCacheTtlMs at config-shape time. MbproxyOptionsValidator.ValidateCacheTtl (src/Mbproxy/Options/MbproxyOptions.cs:102-112) and ReloadValidator.CheckTtl (src/Mbproxy/Configuration/ReloadValidator.cs:120-129) inspect the raw options. After the resolver folds the per-PLC default into per-tag values (BcdTagMapBuilder.cs:107), a tag whose explicit CacheTtlMs is null and whose PLC default is 60_001 would pass each individual gate (the per-PLC default check on line 97 does catch this), but a tag whose explicit value is 30_000 and inherited default would also be 30_000 — so the explicit interaction is fine. The resolved BcdTag.CacheTtlMs is, however, not re-checked anywhere — consistent with intent, but worth flagging if the per-tag fold ever grows arithmetic (e.g. clamping or summing). Note: ReloadValidator also runs BcdTagMapBuilder.Build (line 80) and discards its results — the resolved TTLs are never re-validated against AllowLongTtl.

  3. Hot-reload flush has no test coverage. No test in tests/Mbproxy.Tests/Proxy/Cache/* exercises ReplaceContextAsync or the dispose-old-cache path. Given that this is the only mechanism that satisfies the "any tag-list change drops the entire PLC cache" rule, a test would be valuable. (See also the Supervisor review's Critical 1 — the running multiplexer captures _ctx at construction and never re-reads, so even when the supervisor swaps _currentContext the running mux keeps using the old cache. Until that is fixed, the "flush" semantics are only realised when the listener is later restarted by Polly.)

  4. Cache survives backend disconnect — not directly tested for invalidation skip. The FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated test (ResponseCacheMultiplexerTests.cs:506-543) covers the read side. There is no test for the design clause "Invalidations during a recovering listener state are skipped" (docs/Architecture/ResponseCache.md:299-304). The implementation gates this implicitly: invalidation only runs in RunBackendReaderAsync after a successful FC06/FC16 response arrives (PlcMultiplexer.cs:497-509), and a Recovering state means no backend reader → no response → no invalidation. The implicit gating is correct but fragile and undocumented in code comments.

Minor findings

  1. Invalidate snapshots ALL keys with _entries.Keys.ToArray() then filters by UnitId/FC inside CacheInvalidator.FindOverlapping. ResponseCache.cs:157. For a 1000-entry cache on a multi-unit gateway, every write allocates a 1000-element array and walks it. An obvious micro-optimisation is to filter inside the lock with the matcher. Acceptable at current scale (54 PLCs, 1000 entries cap, low write rate).

  2. Dispose() doesn't drain _entries. ResponseCache.cs:190-202 cancels the eviction loop but leaves entry buffers reachable until GC. Not a correctness issue; just worth a Clear() call.

  3. SweepExpired allocates a fresh List<CacheKey> every tick. ResponseCache.cs:261. With EvictionIntervalMs = 5000 and typical low expiry counts this is one allocation per 5 s — fine — but a reusable pooled list (or two-phase enumerate-and-conditional-remove using Dictionary.Remove's 1-pass nature in newer runtimes) would eliminate it.

  4. BuildCacheHitFrame doesn't validate cachedPdu.Length against the MBAP 253-byte cap. PlcMultiplexer.cs:952-966. Cache stores are guarded upstream (the backend reader enforces MaxPduBodySize), but the Length field arithmetic on line 956 silently overflows a ushort for >65534-byte PDUs. Defensive only.

  5. elapsedMs computed but unused. PlcMultiplexer.cs:445. Dead code.

  6. MaxEntriesPerPlc = 0 semantics. Set is no-op (line 122) but the eviction task still runs every EvictionIntervalMs. Operators who set this to 0 to disable the cache will pay an idle timer cost. Document or short-circuit constructor.

  7. Cache.AllowLongTtl is documented as "reload-time check" but is also enforced at schema-validation time (MbproxyOptions.cs:74-77, ReloadValidator.cs:97-100). Both layers agree but the duplication invites drift if one is updated.

What looks good

  • Lookup order: The cache check at PlcMultiplexer.cs:610-629 happens before EnsureBackendConnectedAsync (line 634) and before the coalescing TryAttachOrCreate block (line 661). A hit returns immediately without creating an InFlightByKeyMap entry — verifiable by CacheHit_ShortCircuits_Coalescing test (ResponseCacheMultiplexerTests.cs:304-343).
  • Post-rewriter store ordering: _pipeline.Process(MbapDirection.ResponseToClient, ...) runs at PlcMultiplexer.cs:455-459, then the cache Set at line 490. The BcdDecodedBytes_AreCached_NotRawBcd test verifies this end-to-end.
  • Multi-tag conservative rule: BcdTagMap.ResolveCacheTtlMs returns 0 immediately on encountering any ttl <= 0 (line 74), bailing the loop. Verified by MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead test.
  • Invalidator scope: CacheInvalidator.FindOverlapping (CacheInvalidator.cs:43-49) explicitly checks both FC03 and FC04, and tests cover both adjacent-but-not-overlapping (register 15 case) and full/partial/disjoint overlap cases (CacheInvalidatorTests.cs). Different unitIds isolated correctly.
  • Exception skip: PlcMultiplexer.cs:469-471 checks (fcInResponse & 0x80) != 0 before either store or invalidate. Consistent with the design clause "exception responses do not invalidate."
  • Counter parity: Cache hit/miss only increment when resolvedCacheTtlMs > 0 (PlcMultiplexer.cs:613), so uncached reads (TTL=0) bump neither, satisfying the Hit + Miss = cache-eligible identity. Tested by MultiTagRange_AnyZeroTtl_DisablesCachingForWholeRead asserting both counters at 0.
  • LRU correctness: Tested by LRU_TracksAccessOrder_AcrossGetAndSet. Monotonic-counter design (CacheEntry.cs:14-16) is robust to wall-clock skew and clock calls.
  • Memory accounting: _approxBytes decremented on every removal path (lazy expiry line 103, eviction line 228, sweep line 272, invalidate line 164, replace line 132, clear line 182). Tested by ApproximateBytes_TracksSetReplaceAndInvalidate.
  • Write invalidation in multiplexer is FC04-aware: Even though writes only land on holding registers, CacheInvalidator evicts both FC03 and FC04 keys at the same address, matching the design's (unitId, FC ∈ {3,4}) scope.
  • FC04 cacheability: PlcMultiplexer.cs:610 and :473 both check is 0x03 or 0x04 — input registers are cacheable, not just holding.
  • Pre-existing cache survives backend down: EnsureBackendConnectedAsync is called only after the cache check returns false (line 634); FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_IfCachePrePopulated verifies.

Open questions

  1. The design specifies mbproxy.cache.flushed events with Reason propagation. Without a caller, what's the intended emission point — in ReplaceContextAsync, in the supervisor's reseat handler, or somewhere in ConfigReconciler? See CacheLogEvents.cs:67-76.
  2. Is cacheBytes intended to reflect dictionary overhead (key+entry headers) or only PDU bytes? Currently only PDU bytes (the Length field). For a 1000-entry cap with small PDUs, dictionary overhead may dominate actual heap; this matters for the "Tier-2 memory-watch KPI" framing.
  3. When a tag-list reload would change MaxEntriesPerPlc (the Cache.MaxEntriesPerPlc knob, not per-tag TTL), the design says "Existing entries unaffected; cap applies to subsequent inserts." But ReplaceContextAsync runs only on per-PLC tag-list change — a MaxEntriesPerPlc change alone does NOT trigger a context swap, so the running ResponseCache keeps its constructor-time cap. Is this acceptable, or should Cache knob changes propagate live?
  4. Does the BcdTagMap.ResolveCacheTtlMs allocation behaviour (allocates the hit list when any tag is in range, then immediately discards on TTL=0) need optimising for the hot path? At ~54 PLCs and modest read rates probably no, but at 1000+ FC03/s per PLC it would matter.