Files
Joseph Doherty f2c6669444 mbproxy/codereviews: 2026-05-14 in-depth review + remediation plan
Eight area-focused reviews (BCD rewriter, multiplexer, response cache,
supervisor + hot-reload, admin + diagnostics, hosting + options, test
suite) plus an Overview that prioritises findings across areas, and a
RemediationPlan that groups the work into three waves with per-item
file:line citations and regression-test sketches.

Findings call out: hot-reload tag-list/cache changes that don't reach the
running multiplexer, a coalescing factory leak that hangs late attachers,
backend-reader head-of-line block on a wedged upstream, stranded outbound
frames after cascade, and ShutdownCoordinator double-stop ordering. Plus
the unconventional 32-bit BCD wire format (two base-10000 digits in CDAB,
not standard binary), unreachable BcdValidationError.DuplicateAddress,
mbproxy.cache.flushed event that's defined but never emitted, and missing
test coverage for Cache.AllowLongTtl.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:15:34 -04:00

14 KiB
Raw Permalink Blame History

PlcMultiplexer Concurrency Review

Scope: src/Mbproxy/Proxy/Multiplexing/*, src/Mbproxy/Proxy/{PlcListener,PerPlcContext,ProxyWorker,ProxyCounters}.cs, the relevant Options POCOs, and the matching tests under tests/Mbproxy.Tests/Proxy/Multiplexing/* + ProxyForwardingTests.cs. Cross-checked against docs/design.md (Connection model, Read coalescing), docs/Architecture/ConnectionModel.md, docs/Architecture/ReadCoalescing.md.

Summary

  • The architecture is sound and the documented invariants (single backend reader/writer, claim-then-dispatch in the watchdog, remove-from-key-map-before-fanout) are upheld. Most paths are well-defended.
  • Two correctness defects in the coalescing path can deliver responses without restoring the original TxId or duplicate counter increments; one in the cache path lets a hit survive a freshly-cascaded backend.
  • A handful of leaks and lifecycle races exist around _connectGate, the outbound channel after teardown, the _disposed flag, and the second backend-CTS dispose.
  • Test coverage is solid for happy-path multiplexing/coalescing/saturation/cascade/watchdog but does not exercise the watchdog↔response race or coalescing-mid-cascade.

Critical Findings

C1. Coalescing factory failure can leak the entry, deliver no response, and hang every late attacher

PlcMultiplexer.cs:692-700 — when _correlation.TryAdd fails inside the factory, the code releases the TxId, logs, and returns the stub inFlight. The map keeps the just-stored entry under key but inFlightForSend stays null, so the post-lock branch at line 724 takes the saturation cleanup (_inFlightByKey.TryRemove(key, out _)) and sends an exception only to the leader. Any late attacher that joined between the factory call and the TryRemove is appended to a List<InterestedParty> that no one will ever drain — its pipe hangs until the upstream socket closes. Same pathology if _allocator.TryAllocate returns false inside the factory: the stub is stored under key, late attaches accumulate, and only the leader gets exception 0x04. The factory must either (a) not store on failure, or (b) the post-lock cleanup must walk all parties on the stub before removing. (See also InFlightByKeyMap.cs:82-83: _entries[key] = req runs unconditionally.)

C2. Coalesce-hit late attaches against an FC03/FC04 cache hit can be silently dropped

PlcMultiplexer.cs:603-624 — for an FC03/FC04 cache hit, the path returns immediately at line 623 without consulting _inFlightByKey. Meanwhile, a peer in flight on the exact same (unit, fc, start, qty) may already be coalescing late attaches (line 645-718). The cache lookup happens before EnsureBackendConnectedAsync and before coalescing — but if the cached entry was just stored by the backend reader at line 490 while a coalesced-leader's response is still in fan-out, a late upstream that arrives during fan-out can take the cache hit path and return correctly. The real hazard is the inverse: at RunBackendReaderAsync line 490 the cache Set happens before the InFlight TryRemove at line 441. A new request arriving between those two steps could miss the cache (entry not yet stored… actually the order is reversed: cache Set is at 490 and _inFlightByKey.TryRemove is at 441, so cache is populated after the key is removed — correct order). However, between line 441 and line 490 a new identical FC03 will: cache miss (not yet set), then check _inFlightByKey — missing too — and open a fresh round-trip. That's wasteful but not corrupting. Downgrade to Major.

C3. Cache hit can be served against a backend that has just cascaded

PlcMultiplexer.cs:610-624 — by design the cache "survives backend disconnects," but the contract in ConnectionModel.md says cascade closes every attached upstream pipe. A new pipe that connects mid-cascade and immediately reads a cached tag returns successfully, while a peer attached one millisecond earlier just had its socket closed. This is an inconsistent visibility model — but the design explicitly accepts it for cached reads. Note however: after a cascade, _inFlightByKey.DrainAll() runs (line 324) without delivering anything to the parties on the drained entries. Those parties' pipes are then disposed at line 338 so the hang in C1 doesn't apply here. Acceptable but worth a doc note.

C4. Backend reader's send to upstream uses the backend CTS, so a slow upstream blocks the whole reader

PlcMultiplexer.cs:545await party.Pipe.SendResponseAsync(outFrame, ct) uses the backend reader's ct (= _backendCts.Token). If the upstream pipe's _responseChannel is full (capacity 16) and the backend writer keeps producing for it, this WriteAsync blocks inside the single backend reader task, stalling responses to every other PLC client sharing that backend socket. The bounded outbound channel only backpressures inbound; on the response side a single wedged upstream stops the world for that PLC. Mitigations: drop-on-full for response channel, write with timeout, or fan-out tasks per party.

Major Findings

M1. _disposed is non-volatile and read on every hot path

PlcMultiplexer.cs:86, checked at lines 142, 181, 216, 566. A non-volatile bool set on the disposing thread is not guaranteed to be observed by other threads under the .NET memory model on weakly-ordered platforms (ARM). Same in UpstreamPipe.cs:52,77. Recommend volatile bool or use Interlocked.Exchange(ref int, 1) pattern.

M2. _disposeCts is disposed while watchdog/cancel paths may still touch it

PlcMultiplexer.cs:209 disposes _disposeCts after a 2-second WaitAsync on the watchdog. If the watchdog took >2 s to settle (e.g. blocked on a slow SendResponseAsync), it will subsequently call Task.Delay(tickMs, ct) against a disposed CTS → ObjectDisposedException, which the outer catch (Exception ex) (line 909) logs as an error. Best-effort, but noisy. Same risk: oldCts?.Dispose() at line 350 races with RunBackendReaderAsync's ct.IsCancellationRequested check at line 392 and with the catch handler.

M3. _connectGate is never disposed

PlcMultiplexer.cs:283. SemaphoreSlim holds an internal WaitHandle lazily; in practice rarely realized so leak is small, but it's a missed dispose on a long-lived object.

M4. Counter parity contract violated when allocator saturates on a coalescing miss

PlcMultiplexer.cs:721-733 — the miss is counted at line 721 (IncrementCoalescedMiss). If the factory then fails (saturation/duplicate-key), the path at line 729 sends exception 04 but never decrements the miss counter. The doc's identity coalescedHitCount + coalescedMissCount = total FC03+FC04 still holds in raw counting terms (each FC03 entered the path once), but the operator semantically expects "miss = a fresh backend round-trip." Saturation produced no round-trip. Document this, or move the increment after success.

M5. Watchdog snapshot enumerates a ConcurrentDictionary non-atomically across multiple ticks

CorrelationMap.cs:75-79 uses foreach (var kvp in _entries) which is a snapshot enumerator but does not take a moment-in-time lock. An entry added after the enumerator started may or may not be seen this tick — fine. The issue is that the watchdog reads kvp.Value.SentAtUtc and may evaluate against an entry whose true SentAtUtc is later than threshold because the entry was added between the snapshot copy at the dictionary level and the field read. Since InFlightRequest is immutable, the field read is consistent, and a too-young entry just won't satisfy <= — so this is benign. Worth a comment.

M6. Watchdog comment says "1-second floor" but code uses 100 ms

PlcMultiplexer.cs:844-846 — the doc string and ConnectionModel.md both say "100 ms floor"; the inline comment at line 845 says "1-second floor". Confusing. The 100 ms floor in code is correct.

M7. Pre-rewriter FC byte read after rewriter mutates the buffer

PlcMultiplexer.cs:468byte fcInResponse = frame[MbapFrame.HeaderSize]; // post-rewriter, but the FC byte is never rewritten. True for FC03/04/06/16 responses, but if a future rewriter ever touches byte 0 of the PDU body this becomes silently wrong. Use inFlight.Fc from the request side instead.

M8. RunBackendWriterAsync doesn't drain remaining frames on cascade

PlcMultiplexer.cs:362-385 — when the writer faults at line 380 it triggers a cascade. Frames already enqueued in _outboundChannel are stranded (the channel is not completed and no subsequent writer task will ever exist against the same channel — EnsureBackendConnectedAsync wires a fresh task at line 269, which then drains those stranded frames into the new backend socket). Those frames carry old proxy TxIds whose correlation entries were just dropped by TearDownBackendAsync. The PLC will respond to TxIds that have no correlation entry, the reader logs nothing (line 423-428: silent drop), the upstream peers hang waiting for the watchdog. Either complete the channel on cascade and recreate it on reconnect, or drain-and-discard before re-running the writer.

M9. IncrementCacheMiss always increments before the fall-through to coalescing

PlcMultiplexer.cs:626 — for a cache-eligible read, the miss is counted, then the request falls through to coalescing. If two upstream peers issue the same read, one stores the response in cache, the other coalesces onto it. Result: one CacheMiss, one CoalescedMiss, one CoalescedHit (the second hits coalesce, not cache). If two peers issue, peer A is the leader: A → CacheMiss, CoalescedMiss; B → CacheMiss, CoalescedHit. Per-second the operator reads CacheMiss = 2 but only one actually had to go to backend. This is reasonable but worth documenting.

Minor Findings

  • PlcMultiplexer.cs:445-450: Dead code — elapsedMs is computed and never used; the actual EWMA conversion happens two lines below.
  • UpstreamPipe.cs:262-263: return firstRead && remaining == count ? false : false; — both branches return false. The conditional is meaningless; just return false;.
  • PlcMultiplexer.cs:269-270: Backend tasks are created with CancellationToken.None so they cannot be observed via that token; relies on cts2.Token propagating cancellation through SendAsync/ReceiveAsync. Fine but worth a comment.
  • PlcMultiplexer.cs:537-540: The Count == 1 reuse of the original frame buffer is correct only because the BCD rewriter has already run. After patching outFrame[0..2] for a single party we mutate the same buffer the cache Set at line 478 already snapshotted (Buffer.BlockCopy), so cache integrity holds. Good — but a comment would help reviewers.
  • InFlightByKeyMap.cs:62 signature: public bool TryAttachOrCreate(...) always returns true; the bool is dead.
  • Test gap: No test exercises C1 (factory-saturation/duplicate-key with multiple attachers in flight). No test exercises the watchdog↔response race (response arriving in the same tick as SnapshotOlderThan). No test exercises cascade racing with a brand-new accept (does the new pipe get attached to the freshly-empty _pipes and survive, or does it land in a zombie multiplexer mid-teardown?).
  • ProxyWorker.cs:64-194: There is no add/remove reconciliation path here — this is delegated to ConfigReconciler via Attach(_supervisors, opts). Out of scope for this audit but worth verifying that reconciler tear-down respects the same disposal ordering.

What Looks Good

  • TxIdAllocator is straightforward and demonstrably correct: single lock on the hot path, scan-from-cursor with wrap detection, defensive double-release no-op, exhaustion returns false cleanly. Stress test at TxIdAllocatorTests.cs:101 covers concurrent allocate/release for duplicate detection.
  • Claim-then-dispatch in the watchdog (PlcMultiplexer.cs:863): TryRemove is the single source of truth; a response that wins the race makes the watchdog skip silently. Symmetric in the response path (line 423).
  • InFlightByKeyMap lock discipline — every state mutation under the same object lock; remove-before-iterate ordering at the multiplexer (line 441) preserves the no-late-attach invariant for fan-out. The Concurrent_AttachOrCreate_From_Two_Threads test (16×500 ops, single key, unbounded MaxParties → exactly one create) is a clean concurrency proof.
  • TearDownBackendAsync idempotency via the _backendLock-protected null-swap (line 290-301) is correct; second invocation early-returns at line 303.
  • BuildExceptionFrame correctly produces a real 9-byte Modbus exception frame so client libraries see a normal ModbusException.
  • Per-call context cloning via WithCurrentRequest (PerPlcContext.cs:61) avoids cross-talk between concurrent multiplexed responses on the shared PerPlcContext.
  • Counters are uniformly Interlocked — no ++ on shared longs, peak via CAS, EWMA via fixed-point CAS loop.

Open Questions

  1. Is the design intent that FC03/FC04 with MaxParties = int.MaxValue is supported in production, or is the test-only setting an antipattern? InFlightByKeyMapTests.cs:208 uses it — a real backend response with 8000 parties means the reader walks 8000 frame clones synchronously inside the single backend reader (C4 amplified).
  2. When EnsureBackendConnectedAsync returns false, OnUpstreamFrameAsync disposes the upstream pipe (line 636). Is that the right behavior given the cache-hit-survives-backend-down contract? An upstream that needs a non-cached read will be killed even if its next read would have been served from cache.
  3. The cascade walks _pipes.Values.ToArray() (line 335). A pipe accepted between snapshot and _pipes.Clear() (line 341) is not disposed and not counted in upstreamCount. The next OnUpstreamFrameAsync it receives will see _disposed == false and try to reuse the multiplexer — likely hits EnsureBackendConnectedAsync and proceeds normally. Intentional?
  4. RunBackendReaderAsync line 550 does _ = TearDownBackendAsync(...) fire-and-forget. The reader task itself then exits cleanly. But the cascade's "best-effort join" at line 348 awaits the same reader task (reader = _backendReaderTask) → self-join? Trace: TearDown snapshots reader from the field, the field is then set to null inside the lock, and reader.WaitAsync is awaited. Since the reader returned and started cascade fire-and-forget, the await on its task completes immediately — fine. Worth a comment confirming the intent.