Files
wwtools/mbproxy/codereviews/2026-05-16/Multiplexing.md
T
Joseph Doherty b222362ce0 mbproxy: remediate the 2026-05-16 code-review findings
Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.

Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 18:08:06 -04:00

23 KiB

Code Review — Backend Connection Layer (Multiplexing + Cache)

Date: 2026-05-16 Branch: mbproxy-webui-dashboard (HEAD 0308490) Scope: src/Mbproxy/Proxy/Multiplexing/ (PlcMultiplexer, UpstreamPipe, CorrelationMap, InFlightByKeyMap, InFlightRequest, TxIdAllocator, CoalescingKey, *LogEvents.cs), src/Mbproxy/Proxy/Cache/ (ResponseCache, CacheEntry, CacheKey, CacheInvalidator, CacheLogEvents), src/Mbproxy/Proxy/MbapFrame.cs.

Summary

The connection layer is well-structured and the threading invariants documented in docs/Architecture/ConnectionModel.md are largely honoured by the code: single backend writer/reader, per-pipe write loop, ConcurrentDictionary-backed correlation map, and a claim-then-dispatch watchdog. The lock-ordering discipline (map lock → no async inside it; connect gate serialising connect vs. teardown) is sound, and the prior reviews' remediations appear intact. However this review found one wire-protocol correctness bug that can deliver a stale value to a client across a write (a real data-integrity hazard for an SCADA proxy), one resource-leak path, and several major correctness gaps around partial-PDU parsing, cache-invalidation arithmetic overflow, and a watchdog/late-attach race window. None of the findings is a deadlock or crash, but the cache-vs-write ordering issue is genuine silent-wrong-value territory and should be treated as the headline item.

Findings by severity: Critical: 1 · Major: 6 · Minor: 7


Critical

C1 — Cache hit can serve a value contradicted by an in-flight write to the same register (read-after-write inversion)

Files: PlcMultiplexer.cs:826-857 (cache lookup), PlcMultiplexer.cs:674-695 (invalidation on FC06/FC16 response).

What's wrong. Cache invalidation for a write fires only in RunBackendReaderAsync after the FC06/FC16 response lands from the PLC (lines 674-695). The cache lookup on a read (lines 826-857) happens with no awareness of writes that are currently in flight but not yet acked. Consider this realistic sequence on one PLC, all on the same register range:

  1. Client A reads [100..110) → cache miss → backend round-trip → response stored in cache (TTL 500 ms).
  2. Client B writes register 105 (FC06) → request enqueued, travels to PLC, PLC applies it.
  3. Client C reads [100..110)cache hit → C receives the pre-write value.
  4. The FC06 response finally arrives → invalidation runs → cache entry for [100..110) dropped.

Between steps 2 and 4 (one full backend round-trip, plus outbound-channel queue depth, plus ECOM scan time — easily tens of ms, far less than a 500 ms TTL) the cache continues to serve the old value even though the PLC has already accepted the new one. Client C is told register 105 is the old value after the write that changed it was already accepted upstream by the proxy. For a Modbus proxy fronting SCADA, that is a silent wrong value delivered to a client — the exact failure mode the review brief calls Critical.

The doc (ResponseCache.md "Write Invalidation") frames invalidation as "a successful FC06/FC16 response invalidates" — i.e. the design intentionally invalidates on the response, not the request. That is defensible only if the cache also refuses to serve a range while a write to an overlapping range is in flight. It does not: there is no in-flight-write tracking on the read path.

Impact. Silent stale value across a write, bounded by one backend round-trip rather than by CacheTtlMs. Operationally most visible when a client writes a setpoint then another client (or the same client on a different connection) immediately reads it back and sees the old value. Note coalescing does not have this problem — it dies on response — but the cache extends the staleness window past the write.

Recommendation. Invalidate (or suppress) on the request side, not only the response side. When an FC06/FC16 frame is parsed in OnUpstreamFrameAsync (the FC06/FC16 branch around lines 805-817), immediately call responseCache.Invalidate(unitId, startAddr, qty) before the write is enqueued, and additionally track the overlapping range as "write pending" so reads that arrive before the write response lands also miss. The simplest correct fix: invalidate on request enqueue and keep the existing response-side invalidation as a backstop. The minor cost (a write may evict an entry that the write later turns out to fail with an exception) is acceptable — a failed write just causes a cache miss and a fresh read, which is always safe; serving a stale value is not.


Major

M1 — EnsureBackendConnectedAsync leaks the backend socket if _disposed flips during connect

File: PlcMultiplexer.cs:287-358.

What's wrong. A caller can enter EnsureBackendConnectedAsync, pass the _disposed check at line 289, acquire _connectGate, and run a multi-second Polly connect (lines 310-324). If DisposeAsync runs concurrently, it sets _disposed = true, cancels _disposeCts, and calls TearDownBackendAsync — which acquires the gate (or times out after 2 s and proceeds gate-less), and at that moment _backendSocket is still null because the connecting task has not yet reached line 341. TearDownBackendAsync sees oldSocket is null && oldCts is null and returns without closing anything. The connecting task then completes its ConnectAsync, takes _backendLock, and assigns a fully live Socket plus three running tasks (_backendWriterTask/ReaderTask/HeartbeatTask) into the now-disposed multiplexer. Nothing ever disposes that socket or joins those tasks; _disposeCts is already disposed so CancellationTokenSource.CreateLinkedTokenSource(_disposeCts.Token) at line 338 throws ObjectDisposedException, leaving backend connected and undisposed.

Impact. Leaked TCP socket to the PLC (one of the ECOM's 4 client slots held forever) plus three orphaned background tasks per occurrence. Occurs on a hot-reload PLC-remove or service stop that races a cold-start connect — not common, but it happens under exactly the churn the supervisor is designed to handle.

Recommendation. After a successful connect, re-check _disposed (and _disposeCts.IsCancellationRequested) under _backendLock before publishing the socket; if disposal won the race, dispose backend and return false. Wrap the CreateLinkedTokenSource in the same guarded block so a disposed _disposeCts cannot throw past the socket assignment.

M2 — Backend reader drops a frame whose PDU body length disagrees with the PLC, with no resync

File: PlcMultiplexer.cs:540-574, UpstreamPipe.cs:111-163.

What's wrong. Both FillAsync loops trust the MBAP Length field absolutely. If the backend (or a middlebox) ever produces a frame whose Length field is wrong — too short — the reader consumes length-1 body bytes, then loops and reinterprets the next bytes (which are really still part of the previous PDU) as a fresh 7-byte MBAP header. From that point every subsequent frame on the socket is mis-framed: proxyTxId is garbage, CorrelationMap.TryRemove misses, frames are silently dropped (line 580-584), and every real in-flight request leaks until the watchdog times it out with 0x0B. The socket never faults, so no cascade fires; the PLC effectively goes dark for BackendRequestTimeoutMs repeatedly. The only oversized-frame guard (line 561) catches length-1 > 253 but not a too small / mis-aligned length.

Impact. A single corrupt length field on the wire desynchronises the backend stream indefinitely. Recovery depends entirely on the backend eventually closing the socket. Realistic on flaky industrial networks; the brief explicitly lists "middlebox drops a packet" as a covered scenario.

Recommendation. This is hard to fully solve without a length-validation/resync strategy, but at minimum: when RunBackendReaderAsync removes a correlation entry it can sanity-check the response PDU length against the request (e.g. an FC03 response body should be 2 + 2*qty); a mismatch is strong evidence of desync and should break to force a teardown+cascade rather than silently fan out a wrong-sized payload. At a minimum, document that the reader assumes a well-framed backend and add a desync-suspected counter when TryRemove misses repeatedly.

M3 — FC03/FC04 cache hit fans out a payload whose register count is not validated against the request

Files: PlcMultiplexer.cs:826-857 (BuildCacheHitFrame), PlcMultiplexer.cs:1356-1370.

What's wrong. CacheKey includes Qty, so a hit only occurs for an identical (unit,fc,start,qty) — good. But BuildCacheHitFrame splices cached.PduBytes onto a fresh MBAP header and sets Length = 1 + pduLen with no upper-bound check. The stored PDU came from the backend reader (line 645-646, snapshot of pduBodyLen bytes which is bounded to ≤253 at line 561), so today it cannot exceed spec. However nothing asserts the cached PDU's declared byte-count actually matches 2*qty. If a backend ever returned a short/long FC03 body that still passed the ≤253 check, that malformed payload is now cached and replayed to every future hit for the lifetime of the TTL — the cache amplifies one bad backend response into many bad client responses. Coalescing has the same single-response blast radius but only for the concurrently-attached parties; the cache widens it across time.

Impact. Lower likelihood than C1/M2 but the same class of harm — a wrong-shaped frame delivered to clients, amplified by caching.

Recommendation. When storing an FC03/FC04 response (line 641-667), validate the PDU body: frame[HeaderSize+1] (byte count) must equal 2*qty and pduBodyLen must equal 2 + 2*qty. Skip the cache store (and ideally skip the fan-out / force teardown) on mismatch. This also strengthens M2's resync story.

M4 — CacheInvalidator.FindOverlapping and the FC16 parse can integer-overflow the half-open upper bound

Files: CacheInvalidator.cs:38,46, PlcMultiplexer.cs:812-817.

What's wrong. writeEnd = writeStart + writeQty and keyEnd = key.StartAddress + key.Qty are computed as int from two ushorts — that part is fine (max 0x1FFFE, no overflow). But the FC16 parse at PlcMultiplexer.cs:815-816 reads qty straight from the wire with no clamp: a malicious or buggy client can send FC16 with qty = 0xFFFF. That qty is then used both as the InFlightRequest.Qty and, on the FC06/FC16 response, passed to postCache.Invalidate(unitId, startAddress, qty). The proxy never validates that qty is within the DL260's FC16 cap of 100 (per dl205.md / CLAUDE.md) — the comment says "the PLC enforces the cap with exception 03", which is true for the write itself, but the invalidation still runs with the bogus 65535-register span because invalidation is gated only on isException == false. If the PLC accepts a smaller write and returns success, an attacker-supplied wide qty cannot reach here — but an FC16 request whose declared qty does not match its byte-count payload, accepted by a lax backend, would invalidate a 64K-register span and flush most of the cache. More concretely: the FC16 request qty and the response qty are assumed identical (the response echoes start+qty), but the code uses the request-side inFlight.Qty for invalidation (line 687) — so a request that lies about qty drives invalidation regardless of what the PLC actually wrote.

Impact. Cache-flush amplification / mild DoS on the cache from a malformed FC16. Not memory-unsafe (the int math does not overflow), but a contract violation: invalidation should reflect what was written, and it currently reflects what the client claimed.

Recommendation. Validate FC16 framing in OnUpstreamFrameAsync: require qty in 1..123 (or the DL260 cap of 100) and require frame.Length >= pduOffset + 6 + 2*qty and byteCount == 2*qty; reject (exception 03) otherwise rather than forwarding. Also prefer driving invalidation from the FC16 response PDU's echoed start/qty rather than the request's.

M5 — Watchdog can miss a late-attached coalescing party, hanging that upstream until its socket times out

Files: PlcMultiplexer.cs:1126-1161, InFlightByKeyMap.cs:59-83.

What's wrong. The watchdog snapshots stale entries from CorrelationMap (line 1123), then for each one does TryRemove from CorrelationMap (line 1131) and only afterwards removes the CoalescingKey from InFlightByKeyMap (line 1160). Between the CorrelationMap.TryRemove and the _inFlightByKey.TryRemove there is a window where a new FC03/FC04 request can call AttachOrCreate, find the still-present coalescing entry, and append itself to req.InterestedParties. The watchdog then walks req.InterestedParties at line 1165 — but it captured req from the CorrelationMap.TryRemove out-param, and InterestedParties is the same mutable List<InterestedParty>, so a party appended after the foreach started its enumeration would throw InvalidOperationException ("collection modified") — caught by the outer catch (Exception) at line 1195, which logs and exits the watchdog loop entirely. A party appended before the foreach but after the CorrelationMap.TryRemove is delivered the 0x0B (fine), but a party that attached and whose request was never enqueued to the backend (the coalescing entry is now orphaned — no proxy TxId of its own) gets nothing if the enumeration already passed it.

The doc (ReadCoalescing.md "Multi-writer multi-reader safety") claims the watchdog "removes the coalescing key before it walks req.InterestedParties". The code does the opposite ordering: CorrelationMap removal first, then the walk at 1165, then _inFlightByKey removal at 1160 is actually before the walk — re-reading: line 1157-1161 (_inFlightByKey.TryRemove) runs before line 1165 (foreach). So the key is removed before the walk. Good. But the CoalescingKey removal at 1160 only happens for req.Fc is 0x03 or 0x04; a non-coalescing FC still has no key. The real residual hole: a late attach in the window between CorrelationMap.TryRemove (1131) and _inFlightByKey.TryRemove (1160) appends to the list, and the foreach at 1165 then enumerates a list that was mutated after TryRemove. If the append lands during the foreach, InvalidOperationException kills the watchdog (caught at 1195 → loop exits → no more timeouts ever fire for this PLC).

Impact. A rare race can permanently disable the timeout watchdog for one PLC, after which any lost backend response leaks its correlation entry forever and hangs upstream clients indefinitely — defeating the entire reason the watchdog exists.

Recommendation. Remove the CoalescingKey from _inFlightByKey before removing from CorrelationMap, mirroring the backend reader's ordering (reader does CorrelationMap.TryRemove then _inFlightByKey.TryRemove — actually the reader has the same ordering; see N-note). The robust fix: in InFlightByKeyMap, hand fan-out callers a frozen copy of the party list at removal time, or snapshot req.InterestedParties.ToArray() immediately after the claim and iterate the copy. Apply the same snapshot in RunBackendReaderAsync's fan-out loop (line 710) for symmetry.

M6 — RunBackendReaderAsync cascade-on-EOF/fault is fire-and-forget; an exception inside TearDownBackendAsync is unobserved

Files: PlcMultiplexer.cs:753-766, 538-537, RunBackendHeartbeat/Watchdog similar.

What's wrong. Reader EOF (line 756), reader fault (765), and writer fault (536) all start TearDownBackendAsync(...) with _ = Task.Run-style fire-and-forget (just _ = TearDownBackendAsync(...)). TearDownBackendAsync is async Task; if it throws synchronously before its first await — or after, on a path the try/finally doesn't cover — the exception lands on an unobserved Task. The code guards the common case (if (!_disposeCts.IsCancellationRequested)) to avoid the disposed-_connectGate race, but _connectGate.WaitAsync can still throw ObjectDisposedException if disposal flips between the check and the call — that path is handled inside TearDownBackendAsync (line 398-403, returns). The residual risk is narrower than M1 but the pattern (unobserved fire-and-forget async Task from three call sites) is fragile: any future exception added to TearDownBackendAsync's pre-await section becomes a process-level UnobservedTaskException.

Impact. Latent; today probably benign. Worth hardening because all three of the layer's failure-detection paths funnel through it.

Recommendation. Wrap the fire-and-forget in a helper that attaches a continuation logging any fault: _ = TearDownBackendAsync(...).ContinueWith(t => _logger.LogError(t.Exception, ...), TaskContinuationOptions.OnlyOnFaulted). Or make the call sites await it where the calling context allows (the reader/writer tasks can await since they are about to exit anyway).


Minor

N1 — UpstreamPipe.RunReadLoopAsync permits a 7-byte zero-body frame but OnUpstreamFrameAsync cannot route it

Files: UpstreamPipe.cs:133-141, PlcMultiplexer.cs:775-797.

A Length < 1 frame is forwarded as a header-only byte[7] (line 136-138). OnUpstreamFrameAsync then reads fcByte = frame.Length > pduOffset ? ... : 0fcByte = 0, falls through every FC branch, allocates a proxy TxId, and forwards a 7-byte frame to the backend. A 7-byte MBAP frame with Length=0 is itself malformed Modbus (Length must be ≥2: UnitId+FC). The proxy should reject it rather than allocate a TxId and forward garbage. Recommend: in the read loop, treat length < 2 as a protocol error and close the pipe (consistent with the oversized-frame handling).

N2 — Backend reader's _inFlightByKey.TryRemove runs after CorrelationMap.TryRemove but the doc invariant wants it before fan-out only

File: PlcMultiplexer.cs:580-607.

The ordering is correct for the stated invariant (key removed at 602-607, fan-out at 710), but the CoalescingKey is reconstructed from inFlight.UnitId/Fc/StartAddress/Qty (line 604-605) — if inFlight were ever a heartbeat with Fc=0x03 this would spuriously probe the map; it is guarded by the IsHeartbeat early-continue at 595-596, so this is safe today. Worth a one-line comment that the FC03/FC04 branch is heartbeat-free by construction, so a future refactor doesn't reorder the IsHeartbeat check below it.

N3 — TxIdAllocator.Release silently no-ops a double-release, masking the documented TearDownBackendAsync race

Files: TxIdAllocator.cs:119-129, PlcMultiplexer.cs:377-385.

The "KNOWN RACE" comment block at PlcMultiplexer.cs:377-385 describes a double-release that frees a legitimately in-flight slot. Release cannot detect this — it just checks _inUse[id]. Consider having the allocator track a generation/epoch per slot, or at minimum increment a doubleReleaseSuspected counter when Release is called on an already-free slot, so the (admittedly very rare) silent-request-drop the comment accepts becomes observable in production rather than invisible.

N4 — ResponseCache LRU eviction is O(n) per insert and the doc's 1000-entry assumption is unenforced at the call path

File: ResponseCache.cs:206-230.

EvictLeastRecentlyUsed is a full linear scan, acceptable at the documented 1000-entry cap. But _maxEntries comes from Cache.MaxEntriesPerPlc which is operator-settable with no documented ceiling in this file; a fat-fingered MaxEntriesPerPlc = 1_000_000 turns every cache insert into a million-element scan under the lock, stalling the backend reader for every cache-miss store. Recommend either documenting/validating a hard ceiling or switching to an intrusive LRU (linked-list + dict) if large caps must be supported.

N5 — SnapshotOlderThan allocates a List every watchdog tick even when nothing is stale

File: CorrelationMap.cs:72-81, PlcMultiplexer.cs:1123-1124.

The watchdog ticks every BackendRequestTimeoutMs/4 (≥100 ms) and always allocates a fresh List, then checks .Count == 0. On an idle fleet of 54 PLCs that is 540 throwaway lists/sec. Trivial, but SnapshotOlderThan could return Array.Empty when the map is empty, or the watchdog could skip the call when _correlation.Count == 0.

N6 — CacheEntry.CapturedTags replay on a cache hit re-stamps observations "now" for every hit, including hits served during a backend outage

File: PlcMultiplexer.cs:844-850.

When the cache serves a hit during a recovering backend, the debug view's capture.Record(...) is called with CaptureDirection.Read and the current time, making the detail page show a "fresh" read of a tag that has not actually been read from the PLC for up to CacheTtlMs. This is arguably misleading for an operator debugging during an outage — the value is cache-aged, not live. Consider tagging replayed observations as cache-served, or stamping them with the entry's CachedAtUtc rather than now.

N7 — Doc drift: ConnectionModel.md says the cascade closes "every attached pipe" but the inline comment at the cascade still references the superseded "in flight" wording

File: PlcMultiplexer.cs:448-450.

The comment reads "Close every attached pipe that had a request in flight; the others will simply re-issue" — directly contradicted by the next line ("Per docs/.../ConnectionModel.md, ALL attached upstreams cascade") and by upstreamCount = _pipes.Count. The code is correct (all pipes cascade); the first comment sentence is stale and should be deleted to avoid confusing a future reader into thinking idle pipes survive.


Notes on things checked and found correct

  • TxId wraparound (TxIdAllocator.cs): the forward scan with _inFlightCount < SlotCount guard correctly terminates; _wrapCount increments on 0xFFFF→0x0000 cursor roll. No off-by-one. Reuse-distance maximisation works as documented.
  • Claim-then-dispatch in the watchdog (PlcMultiplexer.cs:1131): the TryRemove race against a real response is correctly handled — whoever wins TryRemove owns dispatch; the loser skips. No double-delivery of a real response + 0x0B for the non-coalesced case.
  • MBAP TxId preservation: proxy TxId is written at request enqueue (lines 1014-1015 / 1080-1081) and the original is restored per-party on the response (lines 736-737); heartbeat and cache-hit and exception frames all carry the correct TxId. Verified correct.
  • Single-party fan-out buffer reuse (line 732-734): the Count == 1 fast path reuses the buffer; multi-party clones per party. No aliasing bug.
  • Connect gate vs. teardown: _connectGate correctly serialises EnsureBackendConnectedAsync against TearDownBackendAsync; the bounded 2 s teardown wait and its documented best-effort fallback are reasonable (the residual race is C-noted at N3, not a new finding).
  • InFlightByKeyMap lock discipline: no async work happens under the lock; the factory is invoked under the lock but does only synchronous allocation/TryAdd. No lock-ordering inversion against TxIdAllocator's lock (allocator lock is always taken inside the map lock, never the reverse).
  • Cache recovering-state invalidation skip: the structural argument in the inline comment (lines 676-685) holds — no backend reader exists in recovering, so no FC06/FC16 response can drive an invalidation. Correct as written (but see C1, which is a different and real problem on the read side).