mbproxy: Wave 2 fixes from 2026-05-14 code review

Resolves the 21 Major findings catalogued in
codereviews/2026-05-14/RemediationPlan.md (Wave 2). Tests: 370 pass / 0 fail
(baseline 363 + 7 new W2 regression tests).

Multiplexer / concurrency:
  W2.1  ConfigReconciler.Attach now threads the live coalescingAccessor through
        to add/restart-built supervisors so a hot-reload of
        ReadCoalescing.{Enabled,MaxParties} propagates to PLCs added or
        restarted via reload.
  W2.2  PlcMultiplexer._disposed and UpstreamPipe._disposed are now volatile
        for ARM/portability defense.
  W2.3  ProxyWorker._supervisors / ConfigReconciler._supervisors switched from
        Dictionary to ConcurrentDictionary; reconciler uses TryRemove. The
        outer Apply is serialised by a semaphore but the inner Add/Remove/
        Restart Task.WhenAll continuations run in parallel.
  W2.4  Counter parity for cache miss + coalescing-saturation miss documented
        inline (per-design contract; behavior unchanged).
  W2.5  _disposeCts.Dispose() and _connectGate.Dispose() guarded against late
        watchdog ticks.
  W2.6  _connectGate disposed in DisposeAsync.
  W2.7  Inline doc clarifying the post-rewriter FC byte read.

Cache / hot-reload:
  W2.8  PlcListenerSupervisor.ReplaceContextAsync now calls Clear() to capture
        the entry count, emits mbproxy.cache.flushed, then disposes the old
        cache. Previously the event was defined but never emitted.
  W2.9  Inline doc explaining the implicit "skip cache invalidation while
        recovering" gating (no backend reader during recovery → no FC06/FC16
        response → no invalidation).
  W2.10 ReloadValidator now re-checks resolved per-tag CacheTtlMs against
        Cache.AllowLongTtl after BcdTagMapBuilder folds the per-PLC default.

BCD rewriter:
  W2.11 Duplicate addresses detected within Global itself and within the per-PLC
        Add list itself, BEFORE the working dictionary collapses keys. Cross-list
        collisions (Global vs Add) remain the documented width-override pattern.
        Previously the DuplicateAddress error was unreachable dead code.
  W2.12 OverlappingHighRegister reports each colliding pair exactly once
        (canonicalised low/high pair tracked in a HashSet).
  W2.13 FC16 32-bit write rejects clientLow > 9999 or clientHigh > 9999 BEFORE
        the high*10000+low reconstruction. Without this guard, (high=9999,
        low=9999) silently re-encoded as (high=9998, low=9999), losing 1 from
        the high word.
  W2.14 FC16 validates pdu.Length >= 6 + qty*2 upfront — no half-rewritten
        requests when a malformed client claims more registers than it ships.

Supervisor:
  W2.15 WaitForInitialBindAttemptAsync now backed by TaskCompletionSource
        instead of 10ms busy-poll. Resolves race against fast Stopped→Bound→
        Stopped transitions and hangs when the supervisor task throws.
  W2.16 StartAsync refuses re-entry on a non-Stopped supervisor (was leaking
        the previous _supervisorCts).
  W2.17 New TransitionTo helper writes _state, _lastBindError, and (optionally)
        _recoveryAttempts under one lock. Snapshot() reads under the same lock
        so the status page never reports an inconsistent triple. Truncate
        helper extracted (was copy-pasted across three sites).
  W2.18 MbproxyOptionsValidator + ReloadValidator reject Connection.{Backend
        ConnectTimeoutMs, BackendRequestTimeoutMs, GracefulShutdownTimeoutMs}
        <= 0. Misconfigured 0 produces immediate CancelAfter(0) failures.

Hosting / diagnostics:
  W2.20 ProxyWorker.StopAsync supervisor-stop deadline now reads from
        IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs
        (was hard-coded 5s).
  W2.21 src/Mbproxy/appsettings.json deleted; the published file is now a Link
        to install/mbproxy.config.template.json so the binary ships with a
        usable, fully-commented example config instead of an empty stub. Tests
        strip the inherited file from their bin via an AfterTargets="Build"
        Target so they don't pick up the template's example PLCs.
  W2.22 invalidBcdWarnings (PlcPdusStatus) and codeOther (ExceptionCounts)
        added to StatusDto, plumbed through StatusSnapshotBuilder, surfaced
        in StatusHtmlRenderer table cells.
  W2.23 EventLogBridge caches EventLog.SourceExists at construction so Emit
        doesn't hit the registry on every Error+ log line.

New regression tests:
  ReloadValidatorTests:
    Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails
    Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes
    Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails
    Validate_ZeroBackendConnectTimeoutMs_Fails
    Validate_NegativeGracefulShutdownTimeoutMs_Fails
  BcdPduPipelineTests:
    FC16_32Bit_ClientHighOrLowAbove9999_PassesThroughRaw_WithInvalidBcdWarning
    FC16_TruncatedRegisterData_PassesThroughRaw_NoPartialRewrite

Reworked tests in BcdTagMapBuilderTests for the W2.11 contract (Global dup,
Add dup, Add-overrides-Global accepted as width override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 05:48:44 -04:00
parent ce32c5cee8
commit e66b17fe5f
21 changed files with 545 additions and 163 deletions
@@ -88,7 +88,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
private Task? _backendReaderTask;
private readonly CancellationTokenSource _disposeCts = new();
private bool _disposed;
// Phase 12 (W2.2) — volatile so the disposing thread's write is observed by every
// hot-path reader (OnUpstreamFrameAsync, ReplaceContext, Attach, etc.) without a
// separate fence. On x86/x64 plain reads happen to give acquire-release semantics, so
// this is defense for ARM hosts and future portability.
private volatile bool _disposed;
private Task? _watchdogTask;
public PlcMultiplexer(
@@ -240,7 +244,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
}
_pipes.Clear();
_disposeCts.Dispose();
// Phase 12 (W2.5, W2.6) — guard the CTS dispose against a watchdog tick that
// raced past the WaitAsync above (e.g. a slow Task.Delay completion observing
// cancellation late). Also dispose the connect-gate semaphore.
try { _disposeCts.Dispose(); } catch (ObjectDisposedException) { /* already disposed */ }
try { _connectGate.Dispose(); } catch (ObjectDisposedException) { /* already disposed */ }
}
// ── Backend connect / teardown ────────────────────────────────────────────
@@ -522,9 +530,14 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// cache-eligible (resolvedTtlMs > 0).
// * FC06/FC16 successful responses invalidate every cached entry whose
// address range overlaps the write.
//
// Phase 12 (W2.7) — exception bit comes from the post-rewriter buffer
// (the rewriter never touches the FC byte today, but reading from
// inFlight.Fc would lose the exception bit). The base FC for routing
// decisions uses inFlight.Fc — the request side knows what was sent.
if (_ctx.Cache is { } postCache)
{
byte fcInResponse = frame[MbapFrame.HeaderSize]; // post-rewriter, but the FC byte is never rewritten
byte fcInResponse = frame[MbapFrame.HeaderSize];
bool isException = (fcInResponse & 0x80) != 0;
if (!isException)
@@ -555,6 +568,16 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
}
else if (inFlight.Fc is 0x06 or 0x10)
{
// Phase 12 (W2.9) — the design contract "invalidations during a
// recovering listener state are skipped" (design.md:203) is
// upheld IMPLICITLY here: invalidation only fires inside the
// backend reader task when a non-exception FC06/FC16 response
// arrives. A `Recovering` listener has no backend reader (the
// multiplexer is torn down between recovery attempts), so no
// response can land here, so no invalidation. The gating is
// structural, not conditional. If a future change ever produces
// a write response off the live backend, an explicit recovering-
// state check would need to be added.
int invalidated = postCache.Invalidate(
inFlight.UnitId, inFlight.StartAddress, inFlight.Qty);
if (invalidated > 0)
@@ -692,6 +715,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return;
}
// Per design contract: "miss" = "fell through to coalescing/backend".
// When two upstream peers issue the same cache-eligible read, both increment
// CacheMiss; only one then opens a backend round-trip (the second coalesces
// onto the first via the InFlightByKey path below). So `CacheMiss` does NOT
// equal "produced a backend round-trip" — it equals "did not find a fresh
// cache entry". The identity `Hit + Miss = cache-eligible requests` holds.
_ctx.Counters.IncrementCacheMiss();
CacheLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty);
}
@@ -786,7 +815,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return;
}
// Coalesce miss: we just opened a fresh in-flight entry.
// Coalesce miss: this request did not attach to an in-flight peer. Per the
// design contract `coalescedHitCount + coalescedMissCount = total FC03/FC04`,
// so even saturation-failure paths (factory below returns null inFlightForSend)
// count as a miss — every FC03/FC04 entered the coalescing path exactly once.
// "Miss" here means "did not coalesce", NOT "produced a backend round-trip".
_ctx.Counters.IncrementCoalescedMiss();
CoalescingLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty);
@@ -49,7 +49,9 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
// Internal CTS lets the multiplexer signal "drop this pipe now" without waiting for
// the upstream socket to close cleanly.
private readonly CancellationTokenSource _cts = new();
private bool _disposed;
// Phase 12 (W2.2) — volatile so writes from DisposeAsync are observed by IsAlive /
// TrySendResponse on other threads without a fence.
private volatile bool _disposed;
// Phase 9: per-pipe forwarded-PDU counter (replaces the per-pair counter from the
// 1:1 model). Read by the status page.