mbproxy: Wave 1 fixes from 2026-05-14 code review

Resolves the four critical correctness defects + the ShutdownCoordinator
double-stop ordering bug called out in codereviews/2026-05-14/Overview.md.
Tests: 362 pass / 0 fail (baseline 358 + 4 new W1 regression tests).

W1.1 — Context swap on running multiplexer.
  PlcMultiplexer._ctx becomes volatile with a new ReplaceContext() method
  that re-registers the cache stats provider on the (preserved) counters.
  PlcListener exposes its multiplexer; PlcListenerSupervisor.ReplaceContextAsync
  swaps the running mux first, then disposes the old cache. Hot-reload
  tag-list changes and the cache-flush-on-reload contract now actually take
  effect on the next PDU instead of waiting for the next listener fault.

W1.2 — Coalescing factory leak.
  When the InFlightByKey factory soft-fails (allocator saturation or duplicate
  TxId), the cleanup path now TryRemoves the stub and walks every party on it
  (including late attachers) to deliver Modbus exception 0x04. Previously
  only the leader got the exception; late attachers waited forever for a
  response that no backend round-trip would ever fire.

W1.3 — Backend-reader head-of-line block.
  UpstreamPipe gains TrySendResponse for non-blocking enqueue. The per-PLC
  backend reader's fan-out loop uses it instead of awaiting SendResponseAsync,
  so a wedged upstream's full bounded response channel can no longer stall
  the single backend reader and starve every other client on that PLC. New
  responseDropForFullUpstream counter on ProxyCounters / CounterSnapshot
  records the drops.

W1.4 — Stranded outbound frames after cascade.
  TearDownBackendAsync acquires _connectGate and drains any frames left in
  _outboundChannel after the writer task faulted/cancelled, releasing their
  proxy TxIds back to the allocator. Without this, a fresh
  EnsureBackendConnectedAsync racing the cascade would send stranded frames
  with old TxIds onto the new backend socket; the responses would arrive
  with no correlation entry and the upstream peers would hang on the
  watchdog until BackendRequestTimeoutMs.

W1.5 — Delete ShutdownCoordinator (Option B).
  Drain logic moved into ProxyWorker.StopAsync. AdminEndpointHost is no
  longer registered as IHostedService; ProxyWorker drives its lifecycle
  directly so admin starts after listeners are bound and stops AFTER the
  in-flight drain (the design's documented contract). Admin is resolved
  lazily in ExecuteAsync to break the circular DI graph
  (Admin -> StatusSnapshotBuilder -> ProxyWorker). GracefulShutdownTimeoutMs
  is now read fresh from IOptionsMonitor.CurrentValue at stop time, so a
  hot-reloaded value is honoured. Removes ShutdownCoordinator + tests.

New tests:
  PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu
  PlcMultiplexerTests.ReplaceContext_NewCache_NextReadGoesToBackend_NotOldCache
  UpstreamPipeTests.TrySendResponse_WhenChannelFull_ReturnsFalse_WithoutBlocking
  UpstreamPipeTests.TrySendResponse_AfterDispose_ReturnsFalse

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 05:16:13 -04:00
parent f2c6669444
commit ce32c5cee8
14 changed files with 614 additions and 532 deletions
@@ -180,35 +180,40 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
RecoveryAttempts: Interlocked.CompareExchange(ref _recoveryAttempts, 0, 0));
/// <summary>
/// Atomically swaps the per-PLC context (tag map) without restarting the listener.
/// Atomically swaps the per-PLC context (tag map + optional response cache) on the
/// running listener AND its live multiplexer.
///
/// <para><b>Transition window</b>: there is a brief overlap where the old
/// <see cref="PlcListener"/> is running its accept loop with the old context while the
/// new context reference is being written. The volatile write ensures that the very
/// next <c>PlcListener</c> constructed inside the Polly loop (on any subsequent fault
/// recovery) picks up <paramref name="newCtx"/>. Existing in-flight upstream pipes
/// served by the current multiplexer keep their reference to the context captured at
/// multiplexer construction time; they finish on the old map. New connections after
/// this call use the new map. This is the correct design — partial-BCD rewrites
/// mid-request would be worse than a one-request gap.</para>
/// <para><b>Phase 12 (W1.1)</b> — previously this method only updated the supervisor's
/// <c>_currentContext</c> slot, which meant the running <see cref="PlcMultiplexer"/>
/// kept using the OLD context (it captured the reference at construction). A reload
/// only became visible on the next listener fault. Now the swap propagates into the
/// running mux via <see cref="PlcMultiplexer.ReplaceContext"/>, so the very next PDU
/// sees the new tag map / new cache. Counters are preserved (the new context carries
/// the same <c>ProxyCounters</c> instance) so operator history is not reset.</para>
///
/// <para>This method is intentionally lightweight: it performs only the volatile write
/// and returns immediately. The <paramref name="ct"/> parameter is present for API
/// symmetry with start/stop and to accommodate future async expansion.</para>
/// <para><b>Old cache lifecycle</b>: the supervisor disposes the outgoing context's
/// cache AFTER the multiplexer has been swapped to the new context. By that point no
/// more reads or writes can land on the old cache. Per the design contract, any
/// tag-list change drops the entire PLC cache.</para>
/// </summary>
public Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)
{
// Phase 11: dispose the outgoing context's response cache (if any) so its
// eviction loop terminates. The "any tag-list reload flushes the affected PLC's
// whole cache" doctrine is satisfied here — the new context constructs its own
// fresh cache, the old cache is dropped wholesale.
var oldCache = _currentContext?.Cache;
// Volatile write: the next PlcListener created in RunSupervisorAsync will see
// the new context. The accept loop itself does not hold a direct reference to
// _currentContext — it was captured at PlcListener construction time.
// Volatile write: the next PlcListener created in RunSupervisorAsync (on any
// subsequent fault recovery) will pick up newCtx through this slot.
_currentContext = newCtx;
// Phase 12 (W1.1) — push the swap into the running multiplexer so existing
// connections see the new tag map / new cache on their next PDU. _currentListener
// may be null between Polly retry attempts; in that case the next listener built
// inside the Polly loop will pick up newCtx through _currentContext above.
_currentListener?.Multiplexer?.ReplaceContext(newCtx);
// Phase 12 (W1.1 + W2.8 prereq) — drop the outgoing cache AFTER the swap so the
// running multiplexer can no longer reach it. Dispose stops the eviction loop and
// releases the timer. (The cache.flushed log event is W2.8 work; this Wave-1 fix
// is the "no longer in use, safe to drop" piece.)
if (oldCache is not null && !ReferenceEquals(oldCache, newCtx.Cache))
oldCache.Dispose();