6 Commits

Author SHA1 Message Date
Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:40:54 -04:00
Joseph Doherty 1a2856526a mbproxy: strip historical phase/wave/plan references from source comments
Comments described the *history* of how the code arrived (phase numbers,
wave IDs, review IDs, dated TODOs) instead of what it does today. That
scaffolding rotted as the codebase evolved. Cleaned 60 source files +
.gitignore; behaviour unchanged (387/387 tests still pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:04:30 -04:00
Joseph Doherty 2545237973 mbproxy: close remaining 5 W3 test gaps from 2026-05-14 review
Closes the 5 "easily addable" W3 test gaps left after the prior W3 commit;
the 5 race-hard gaps remain documented as known omissions per the plan.
Tests: 382 pass / 0 fail (baseline 378 + 4 net new methods — the supervisor
runtime-fault test replaces the existing placeholder).

  #11 BcdCodecTests.Encode16_IntMinValue_Throws_OutOfRange_NoArithmeticSurprise
      Locks the (uint)value > Max16 boundary check against int.MinValue. The
      cast becomes 0x80000000 which is well above 9999, so the throw fires
      cleanly. Prevents regression to a two-sided int comparison that would
      underflow.

  #15 BcdPduPipelineTests.FC03_Request_QtyAbove128_AtNonBcdAddress_PassesThroughUnchanged
      DL205/DL260 caps FC03/FC04 at qty=128 (DL260/dl205.md). The proxy must
      NOT truncate the qty field — passing through unchanged lets the PLC's
      own validator return exception 03 to the client (transparent contract
      for FCs/addresses the rewriter doesn't own).

  #4 SupervisorTests.Supervisor_RuntimeFault_OnRunningListener_RecoversAndRebinds
      Replaces the previous placeholder. Genuinely faults the running listener
      mid-life by stopping its underlying TcpListener via reflection (the
      single externally-observable hook to force the accept loop's
      AcceptAsync to throw ObjectDisposedException). Asserts the supervisor
      transitions to Recovering, re-binds via the Polly pipeline, and bumps
      RecoveryAttempts.

  #10 HotReloadE2ETests.E2E_ReadCoalescingEnabled_FlipAtRuntime_PropagatesToOptionsMonitor
      Validates that flipping Mbproxy.Resilience.ReadCoalescing.Enabled at
      runtime via hot-reload propagates through the live IOptionsMonitor.
      The W2.1 fix wires the accessor through to add/restart supervisors;
      the multiplexer reads it per-PDU (unit-tested separately). Proving
      IOptionsMonitor sees the new value is sufficient for the contract.

  #16 ConfigReconcilerTests.Apply_ManyConcurrentReloads_With_PlcChurn_NoCorruption
      Stress-tests the W2.3 ConcurrentDictionary fix. 16 concurrent applies
      cycle through 8 distinct PLC rosters, driving Add+Remove churn against
      the live supervisor dict. Without W2.3 the inner Task.WhenAll
      continuations would corrupt Dictionary<,> and crash with
      KeyNotFoundException / ArgumentException. Asserts every apply succeeds,
      no orphan supervisors remain, and the reload counter equals 16.

The 5 deterministically-race-hard gaps (#5 TxId saturation propagation, #6
coalescing factory leak under saturation, #7 backend-reader head-of-line
block, #8 watchdog↔response race, #9 cascade↔new-accept race) remain open
by design — reproducing those races deterministically requires test seams in
production code or stress-style tests that flake on slow CI. The Wave-1
fixes are still verified at the unit-contract level
(UpstreamPipeTests.TrySendResponse_WhenChannelFull, etc.).

This closes everything actionable in codereviews/2026-05-14/RemediationPlan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:17:03 -04:00
Joseph Doherty 7ead3581ab mbproxy: Wave 3 cleanups, docs, and test gaps from 2026-05-14 review
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).

Code cleanups:
  * PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
    conversion uses Stopwatch ticks two lines below).
  * UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
    == count ? false : false` ternary; both branches were `false`.
  * InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
    `AttachOrCreate` and made `void`. Test sites updated to drop the dead
    `bool ok = ...; ok.ShouldBeTrue();` assertions.
  * BcdCodec.HasBadNibble promoted from private to internal; the duplicate
    copy in BcdPduPipeline removed and the call sites updated to
    `BcdCodec.HasBadNibble`.
  * PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
    100 ms. Now both agree.
  * StatusSnapshotBuilder: simplified the unreachable
    `RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
    `RemoteEp?.ToString() ?? "?"`.
  * Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
    description of where Polly is used (BackendConnect + ListenerRecovery).

Doc updates:
  * README: added a callout about the unconventional 32-bit BCD wire format
    ("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
    integrators using off-the-shelf clients learn about the silent-corruption
    hazard before configuring writes.
  * docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
    semantics — "miss" means "did not find a fresh entry / did not coalesce",
    NOT "produced a backend round-trip". Operators wanting actual backend
    traffic should compute `miss − coalescedHit − exception04`.
  * docs/Architecture/ResponseCache.md: documented the structural
    "skip invalidation while recovering" gating (no backend reader during
    recovery → no FC06/FC16 response → no invalidation).
  * docs/Operations/Configuration.md: noted that the Event Log sink is the
    custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
  * docs/plan/README.md: added a Phase 12 row pointing at the remediation
    plan and linking out to codereviews/2026-05-14/.

Test additions (W3 high-value gaps):
  * BcdPduPipelineTests:
    - FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
      (symmetric inverse of the existing low-side partial-overlap test).
    - FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
      (mixed-slot routing in a single FC03 read).
    - FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
      carries no register data; rewriter must pass through).
  * AdminEndpointTests:
    - NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
      PATCH against `/` and `/status.json` must return 405; guards against
      an accidental MapPost being added later).
  * HotReloadE2ETests:
    - E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
      W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
      reload path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:06:52 -04:00
Joseph Doherty e66b17fe5f mbproxy: Wave 2 fixes from 2026-05-14 code review
Resolves the 21 Major findings catalogued in
codereviews/2026-05-14/RemediationPlan.md (Wave 2). Tests: 370 pass / 0 fail
(baseline 363 + 7 new W2 regression tests).

Multiplexer / concurrency:
  W2.1  ConfigReconciler.Attach now threads the live coalescingAccessor through
        to add/restart-built supervisors so a hot-reload of
        ReadCoalescing.{Enabled,MaxParties} propagates to PLCs added or
        restarted via reload.
  W2.2  PlcMultiplexer._disposed and UpstreamPipe._disposed are now volatile
        for ARM/portability defense.
  W2.3  ProxyWorker._supervisors / ConfigReconciler._supervisors switched from
        Dictionary to ConcurrentDictionary; reconciler uses TryRemove. The
        outer Apply is serialised by a semaphore but the inner Add/Remove/
        Restart Task.WhenAll continuations run in parallel.
  W2.4  Counter parity for cache miss + coalescing-saturation miss documented
        inline (per-design contract; behavior unchanged).
  W2.5  _disposeCts.Dispose() and _connectGate.Dispose() guarded against late
        watchdog ticks.
  W2.6  _connectGate disposed in DisposeAsync.
  W2.7  Inline doc clarifying the post-rewriter FC byte read.

Cache / hot-reload:
  W2.8  PlcListenerSupervisor.ReplaceContextAsync now calls Clear() to capture
        the entry count, emits mbproxy.cache.flushed, then disposes the old
        cache. Previously the event was defined but never emitted.
  W2.9  Inline doc explaining the implicit "skip cache invalidation while
        recovering" gating (no backend reader during recovery → no FC06/FC16
        response → no invalidation).
  W2.10 ReloadValidator now re-checks resolved per-tag CacheTtlMs against
        Cache.AllowLongTtl after BcdTagMapBuilder folds the per-PLC default.

BCD rewriter:
  W2.11 Duplicate addresses detected within Global itself and within the per-PLC
        Add list itself, BEFORE the working dictionary collapses keys. Cross-list
        collisions (Global vs Add) remain the documented width-override pattern.
        Previously the DuplicateAddress error was unreachable dead code.
  W2.12 OverlappingHighRegister reports each colliding pair exactly once
        (canonicalised low/high pair tracked in a HashSet).
  W2.13 FC16 32-bit write rejects clientLow > 9999 or clientHigh > 9999 BEFORE
        the high*10000+low reconstruction. Without this guard, (high=9999,
        low=9999) silently re-encoded as (high=9998, low=9999), losing 1 from
        the high word.
  W2.14 FC16 validates pdu.Length >= 6 + qty*2 upfront — no half-rewritten
        requests when a malformed client claims more registers than it ships.

Supervisor:
  W2.15 WaitForInitialBindAttemptAsync now backed by TaskCompletionSource
        instead of 10ms busy-poll. Resolves race against fast Stopped→Bound→
        Stopped transitions and hangs when the supervisor task throws.
  W2.16 StartAsync refuses re-entry on a non-Stopped supervisor (was leaking
        the previous _supervisorCts).
  W2.17 New TransitionTo helper writes _state, _lastBindError, and (optionally)
        _recoveryAttempts under one lock. Snapshot() reads under the same lock
        so the status page never reports an inconsistent triple. Truncate
        helper extracted (was copy-pasted across three sites).
  W2.18 MbproxyOptionsValidator + ReloadValidator reject Connection.{Backend
        ConnectTimeoutMs, BackendRequestTimeoutMs, GracefulShutdownTimeoutMs}
        <= 0. Misconfigured 0 produces immediate CancelAfter(0) failures.

Hosting / diagnostics:
  W2.20 ProxyWorker.StopAsync supervisor-stop deadline now reads from
        IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs
        (was hard-coded 5s).
  W2.21 src/Mbproxy/appsettings.json deleted; the published file is now a Link
        to install/mbproxy.config.template.json so the binary ships with a
        usable, fully-commented example config instead of an empty stub. Tests
        strip the inherited file from their bin via an AfterTargets="Build"
        Target so they don't pick up the template's example PLCs.
  W2.22 invalidBcdWarnings (PlcPdusStatus) and codeOther (ExceptionCounts)
        added to StatusDto, plumbed through StatusSnapshotBuilder, surfaced
        in StatusHtmlRenderer table cells.
  W2.23 EventLogBridge caches EventLog.SourceExists at construction so Emit
        doesn't hit the registry on every Error+ log line.

New regression tests:
  ReloadValidatorTests:
    Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails
    Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes
    Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails
    Validate_ZeroBackendConnectTimeoutMs_Fails
    Validate_NegativeGracefulShutdownTimeoutMs_Fails
  BcdPduPipelineTests:
    FC16_32Bit_ClientHighOrLowAbove9999_PassesThroughRaw_WithInvalidBcdWarning
    FC16_TruncatedRegisterData_PassesThroughRaw_NoPartialRewrite

Reworked tests in BcdTagMapBuilderTests for the W2.11 contract (Global dup,
Add dup, Add-overrides-Global accepted as width override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:48:44 -04:00
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00