Compare commits

14 Commits

Author SHA1 Message Date
Joseph Doherty b330faff03 mbproxy: cross-platform support — Linux/systemd alongside Windows
Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00
Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:40:54 -04:00
Joseph Doherty 7466a46aa7 mbproxy/docs: retire superseded design/plan docs and dissolve DL260/
The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).

All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:37:48 -04:00
Joseph Doherty 0a603f94d0 mbproxy/README: route publish steps through install/publish.ps1
The README was telling readers to memorise the dotnet publish flag set; now
it points at the script that captures both flavours so the documented path
matches what we actually run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:55:07 -04:00
Joseph Doherty ee1ae89e25 mbproxy/install: add publish.ps1 for the two single-file build flavours
Captures the dotnet publish invocations used to produce the self-contained
(~100 MB, bundles .NET 10) and framework-dependent (~1.5 MB, requires .NET
10 preinstalled) win-x64 single-file Mbproxy.exe builds, so re-cutting a
release isn't institutional knowledge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:53:22 -04:00
Joseph Doherty 1a2856526a mbproxy: strip historical phase/wave/plan references from source comments
Comments described the *history* of how the code arrived (phase numbers,
wave IDs, review IDs, dated TODOs) instead of what it does today. That
scaffolding rotted as the codebase evolved. Cleaned 60 source files +
.gitignore; behaviour unchanged (387/387 tests still pass).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 13:04:30 -04:00
Joseph Doherty b3b8313e9c mbproxy: Wave 6 — wire ProxyCounters.AddBytes (bytes counters were always 0)
The 10-min stress test (1.46 M PDUs through the proxy) revealed that
status.json's bytes.upstreamIn / bytes.upstreamOut counters always read 0
because ProxyCounters.AddBytes was defined but never called from anywhere.
Same shape as the original review's W2.22 finding (counter wired in DTO +
HTML but no increment site), missed for the bytes counters specifically.

Wired five increment sites in PlcMultiplexer:

  OnUpstreamFrameAsync (request side, parsed frame)
    AddBytes(up: frame.Length, down: 0) — counted ONCE per parsed frame
    regardless of subsequent routing (cache hit, coalesce, backend
    round-trip, exception).

  RunBackendReaderAsync fan-out (response side, after TrySendResponse=true)
    AddBytes(up: 0, down: outFrame.Length) per delivered party. With
    coalescing, one backend response fans out to N parties and produces
    N × frame.Length bytes leaving the proxy upstream-side. Drops
    (TrySendResponse=false) increment ResponseDropForFullUpstream
    instead.

  Cache hit path
    AddBytes(up: 0, down: hitFrame.Length) for the BuildCacheHitFrame
    response (no backend round-trip but still bytes leaving the proxy).

  Saturation cleanup (W1.2 path, both branches)
    AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x04.

  Non-coalescing-path saturation
    AddBytes(up: 0, down: excFrame.Length) for the single exception 0x04.

  Watchdog timeout exception delivery
    AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x0B.

Backend-side bytes (proxy ↔ PLC) are NOT counted by these counters — the
field name is `BytesUpstreamIn/Out` which is upstream-only by contract.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 09:30:48 -04:00
Joseph Doherty 59d0b5deb9 mbproxy: Wave 5 — fixes from third re-review pass
Closes findings from the third focused re-review pass on the post-W4-followup
state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md).

W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync
  The hot-reload OnChange handler at AdminEndpointHost.StartAsync did
  fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort
  was hot-reloaded during shutdown, the queued Task could land between
  StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take
  the lock, and bind a fresh Kestrel WebApplication on the new port —
  resurrecting admin AFTER the host considered it shut down. Worse, if
  DisposeAsync had already run _lock.Dispose, the queued Task throws
  ObjectDisposedException as an unobserved Task exception. Fix: _disposed
  guard at the top of the OnChange lambda AND inside the queued Task.Run,
  plus try/catch (ObjectDisposedException) around _lock.WaitAsync and
  _lock.Release.

W5/m2 — inFlightAtCancel computed AFTER base.StopAsync
  The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync
  (so the multiplexers' counter providers were still wired), but it computed
  the snapshot AFTER base.StopAsync(cancellationToken). Between those two
  lines, in-flight requests whose responses arrive get removed from
  _correlation, and the watchdog can clear stale entries. The reported
  count therefore drifted downward from "in-flight at signal time" to
  "in-flight at compute time." Fix: snapshot at the very top of StopAsync
  before any cancellation is propagated.

W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort)
  When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the
  body runs unprotected. A concurrent EnsureBackendConnectedAsync that
  DOES hold the gate may TryAllocate a TxId that collides (after wraparound
  in the allocator's forward scan) with one being released by the channel
  drain. The double-release would mark the new request's slot as free even
  though it's legitimately in-flight, allowing the next allocation to reuse
  the same slot and CorrelationMap.TryAdd to fail (silent request drop).
  Probability is very low (gate timeout AND new accept landing AND TxId
  collision in 65,536-slot space); the only consequence is one dropped
  request the client retries. Documented inline at PlcMultiplexer.cs near
  the gateHeld declaration as accepted best-effort behaviour.

W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor
  Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per
  re-review verdict.

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:13:47 -04:00
Joseph Doherty 9251c564c1 mbproxy: resolve remaining items from ReReviewAfterRemediation.md
Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.

NM3 — _supervisorCts leaks on re-Start
  StartAsync now disposes the previous CTS before reassigning. Idempotent:
  a try/catch (ObjectDisposedException) covers the very-first-Start case
  where the field-init CTS is still fresh.

NM4 — W2.15 TCS is single-shot
  _firstAttemptCompleted is no longer readonly; StartAsync re-creates it
  after the W2.16 guard so a re-Started supervisor's
  WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.

Nm6 — _admin GetService<> returns null silently
  ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
  Preserves the loud-failure intent from the original IHostedService
  registration without forcing test hosts to wire admin.

Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
  Added a volatile bool _disposed flag with an early-return at the top of
  DisposeAsync. Symmetry with PlcMultiplexer; protects against
  ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
  again on host shutdown.

T3 — RemoveInheritedAppsettings only fires on Build
  AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
  so a `dotnet publish` against the test csproj doesn't ship the example
  PLCs from the linked install template.

T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
  Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
  return.

Accepted (with rationale documented in ReReviewAfterRemediation.md):
  Nm2 — CoalescedHit semantic is per-design
  Nm4 — _lastBindError preservation on clean exit is intentional forensics
  Nm5 — EventLogBridge has no injectable logger
  Nm8 — Cosmetic log noise
  T1  — Reflection on private fields documented as maintenance trap

Tests: 387 pass / 0 fail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 07:02:21 -04:00
Joseph Doherty 7a435957ee mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes
Closes the new findings from the post-remediation re-review
(codereviews/2026-05-14/ReReviewAfterRemediation.md):

NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero
  Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was
  meant to replace. Supervisor.StopAsync nulls the per-mux counter
  provider before the drain loop runs, so CountInFlight always returns 0
  and the drain budget is never spent on actual draining. Fix: snapshot
  the in-flight count BEFORE supervisor stop, drop the theatrical
  post-stop loop, and report InFlightAtCancel as the snapshot count
  (= the number of in-flight requests dropped by the stop). The
  supervisor stop IS the drain — there is nothing to drain that
  wouldn't be killed by the stop itself.

NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable
  Without a token, a long Polly-wrapped EnsureBackendConnectedAsync
  against an unreachable host could hold the gate for the full
  BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync
  (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound
  the wait with a 2 s teardown deadline; on timeout proceed
  best-effort without the gate. Worst-case consequence is one orphaned
  in-flight cycle on the dying backend, surfaced to upstream as
  exception 0x0B by the watchdog.

NM2 — ReplaceContext non-atomic ctx + provider swap
  Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If
  `_ctx` was swapped first, a snapshot taken in the gap would still hold
  the OLD adapter wrapping the OLD cache — which the supervisor disposes
  immediately after we return. Fix: set the provider FIRST, then swap
  `_ctx`. Snapshots in the swap window now read either (old, old) or
  (new, new), never (old-after-disposed).

NM5 — Self-cascade ObjectDisposedException after dispose
  Writer/reader fault catches fired `_ = TearDownBackendAsync(...)`
  unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the
  fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as
  an unobserved Task exception. Fix: skip self-cascade when
  `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit
  TearDown anyway.

Nm1 — Saturation cleanup uses await SendResponseAsync
  W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync,
  which would serialise on a wedged late-attacher's full bounded channel
  and stall delivery to its peers — contradicting the W1.3 doctrine that
  the fan-out path must never await per-pipe writes. Fix: use
  TrySendResponse and increment ResponseDropForFullUpstream on drop.

T2 — WatchdogVsResponse_Race seeded Random fragility
  Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms;
  Random's algorithm is implementation-defined across .NET major versions
  (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all
  samples on one side of the deadline and break the "both branches must
  fire" assertion. Fix: deterministic counter-based alternation (15 fast
  + 15 slow across 30 iterations) — guaranteed by construction.

Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS
single-shot semantics) are unfixed: no caller actually re-Starts a
supervisor today; both become real only if the reconciler ever changes
to re-Start instead of dispose-and-rebuild. Documented in the re-review.

Tests: 387 pass / 0 fail. Three back-to-back race-test runs in
isolation all green (T2 alternation is deterministic).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:52:33 -04:00
Joseph Doherty 53f842a655 mbproxy: close all 5 race-hard W3 test gaps from 2026-05-14 review
Closes the 5 deterministically-race-hard test gaps that were previously
documented as known omissions (#5–9 in codereviews/2026-05-14/RemediationPlan.md).
Tests: 387 pass / 0 fail (baseline 382 + 5 new race tests). Three back-to-back
runs in isolation all green — no observable flakes.

Each test reaches the relevant code path deterministically by either:
  - reaching into the multiplexer's private state via reflection (only used
    for pre-saturating the TxIdAllocator — the test path that's externally
    impossible to hit otherwise without spawning 65,536 real connections),
  - constructing a backend stub that exercises the timing window directly, or
  - asserting only the externally-observable contract that holds across all
    valid interleavings (no-double-delivery, no-hang) rather than asserting
    a specific ordering that flakes.

W3 #5 — TxIdAllocator_Saturated_NextRequest_GetsException04_WithOriginalTxId
  Pre-saturates the multiplexer's _allocator via reflection (TryAllocate
  ×65536), then sends one FC06 write. The next request hits the
  !_allocator.TryAllocate branch immediately and the test verifies exception
  04 with the original TxId echoed.

W3 #6 — TxIdAllocator_Saturated_TwoConcurrentIdenticalReads_BothPipesGetException04
  Pre-saturates the allocator, then fires two concurrent identical FC03 reads
  from two pipes. Both pipes must receive exception 04 — regardless of whether
  pipe B coalesces onto pipe A's stub (W1.2's deliver-to-late-attachers path)
  OR opens its own factory failure path. The contract verified is "no late
  attacher hangs" — the externally-observable invariant from the W1.2 fix.

W3 #7 — SlowUpstream_DoesNotStallPeerResponses_DropCounterIncrements
  Wedges upstream A by leaving its socket-receive side undrained, pumps 500
  FC03 requests through A so the bounded response channel + kernel buffer
  fill, then sends one request from a healthy upstream B. B's response must
  arrive within seconds (would block forever pre-W1.3) and A's
  ResponseDropForFullUpstream counter must increment — proving the W1.3
  TrySendResponse non-blocking fan-out works as designed.

W3 #8 — WatchdogVsResponse_Race_AlwaysExactlyOneOutcome_PerRequest
  Custom SlowResponseBackend stub responds at a randomized 350–450 ms delay
  while BackendRequestTimeoutMs=400. Across 30 iterations, the request races
  the watchdog's per-tick scan. The contract asserts: every request gets
  exactly ONE response (normal or exception 0x0B), the original TxId is
  always echoed, and BOTH branches are exercised (proving the race window is
  real). The W1 claim-then-dispatch design (CorrelationMap.TryRemove as the
  single source of truth) makes this contract hold across all interleavings.

W3 #9 — CascadeVsNewAccept_StressChurn_NoCrash_NoHang
  Stress-test: 3 cascade cycles, 8 concurrent connect+request tasks per
  cycle. Backend is killed mid-cascade-storm to force teardown to race the
  in-flight connect attempts. After all churn the multiplexer must still
  serve a normal request. The originally-flagged race (a pipe added between
  _pipes.Values.ToArray() and _pipes.Clear() in TearDownBackendAsync) is
  microseconds wide and not deterministically reproducible without test
  seams; this stress test instead proves the no-crash-under-churn property
  that operators care about.

Helpers added:
  DrainAllocator(PlcMultiplexer) — reflection-based saturation primitive,
    only used by tests #5 and #6.
  SlowResponseBackend — backend stub with caller-supplied per-request delay
    via a Func<int>, only used by test #8.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:29:44 -04:00
Joseph Doherty 2545237973 mbproxy: close remaining 5 W3 test gaps from 2026-05-14 review
Closes the 5 "easily addable" W3 test gaps left after the prior W3 commit;
the 5 race-hard gaps remain documented as known omissions per the plan.
Tests: 382 pass / 0 fail (baseline 378 + 4 net new methods — the supervisor
runtime-fault test replaces the existing placeholder).

  #11 BcdCodecTests.Encode16_IntMinValue_Throws_OutOfRange_NoArithmeticSurprise
      Locks the (uint)value > Max16 boundary check against int.MinValue. The
      cast becomes 0x80000000 which is well above 9999, so the throw fires
      cleanly. Prevents regression to a two-sided int comparison that would
      underflow.

  #15 BcdPduPipelineTests.FC03_Request_QtyAbove128_AtNonBcdAddress_PassesThroughUnchanged
      DL205/DL260 caps FC03/FC04 at qty=128 (DL260/dl205.md). The proxy must
      NOT truncate the qty field — passing through unchanged lets the PLC's
      own validator return exception 03 to the client (transparent contract
      for FCs/addresses the rewriter doesn't own).

  #4 SupervisorTests.Supervisor_RuntimeFault_OnRunningListener_RecoversAndRebinds
      Replaces the previous placeholder. Genuinely faults the running listener
      mid-life by stopping its underlying TcpListener via reflection (the
      single externally-observable hook to force the accept loop's
      AcceptAsync to throw ObjectDisposedException). Asserts the supervisor
      transitions to Recovering, re-binds via the Polly pipeline, and bumps
      RecoveryAttempts.

  #10 HotReloadE2ETests.E2E_ReadCoalescingEnabled_FlipAtRuntime_PropagatesToOptionsMonitor
      Validates that flipping Mbproxy.Resilience.ReadCoalescing.Enabled at
      runtime via hot-reload propagates through the live IOptionsMonitor.
      The W2.1 fix wires the accessor through to add/restart supervisors;
      the multiplexer reads it per-PDU (unit-tested separately). Proving
      IOptionsMonitor sees the new value is sufficient for the contract.

  #16 ConfigReconcilerTests.Apply_ManyConcurrentReloads_With_PlcChurn_NoCorruption
      Stress-tests the W2.3 ConcurrentDictionary fix. 16 concurrent applies
      cycle through 8 distinct PLC rosters, driving Add+Remove churn against
      the live supervisor dict. Without W2.3 the inner Task.WhenAll
      continuations would corrupt Dictionary<,> and crash with
      KeyNotFoundException / ArgumentException. Asserts every apply succeeds,
      no orphan supervisors remain, and the reload counter equals 16.

The 5 deterministically-race-hard gaps (#5 TxId saturation propagation, #6
coalescing factory leak under saturation, #7 backend-reader head-of-line
block, #8 watchdog↔response race, #9 cascade↔new-accept race) remain open
by design — reproducing those races deterministically requires test seams in
production code or stress-style tests that flake on slow CI. The Wave-1
fixes are still verified at the unit-contract level
(UpstreamPipeTests.TrySendResponse_WhenChannelFull, etc.).

This closes everything actionable in codereviews/2026-05-14/RemediationPlan.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:17:03 -04:00
Joseph Doherty 7ead3581ab mbproxy: Wave 3 cleanups, docs, and test gaps from 2026-05-14 review
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).

Code cleanups:
  * PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
    conversion uses Stopwatch ticks two lines below).
  * UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
    == count ? false : false` ternary; both branches were `false`.
  * InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
    `AttachOrCreate` and made `void`. Test sites updated to drop the dead
    `bool ok = ...; ok.ShouldBeTrue();` assertions.
  * BcdCodec.HasBadNibble promoted from private to internal; the duplicate
    copy in BcdPduPipeline removed and the call sites updated to
    `BcdCodec.HasBadNibble`.
  * PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
    100 ms. Now both agree.
  * StatusSnapshotBuilder: simplified the unreachable
    `RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
    `RemoteEp?.ToString() ?? "?"`.
  * Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
    description of where Polly is used (BackendConnect + ListenerRecovery).

Doc updates:
  * README: added a callout about the unconventional 32-bit BCD wire format
    ("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
    integrators using off-the-shelf clients learn about the silent-corruption
    hazard before configuring writes.
  * docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
    semantics — "miss" means "did not find a fresh entry / did not coalesce",
    NOT "produced a backend round-trip". Operators wanting actual backend
    traffic should compute `miss − coalescedHit − exception04`.
  * docs/Architecture/ResponseCache.md: documented the structural
    "skip invalidation while recovering" gating (no backend reader during
    recovery → no FC06/FC16 response → no invalidation).
  * docs/Operations/Configuration.md: noted that the Event Log sink is the
    custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
  * docs/plan/README.md: added a Phase 12 row pointing at the remediation
    plan and linking out to codereviews/2026-05-14/.

Test additions (W3 high-value gaps):
  * BcdPduPipelineTests:
    - FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
      (symmetric inverse of the existing low-side partial-overlap test).
    - FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
      (mixed-slot routing in a single FC03 read).
    - FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
      carries no register data; rewriter must pass through).
  * AdminEndpointTests:
    - NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
      PATCH against `/` and `/status.json` must return 405; guards against
      an accidental MapPost being added later).
  * HotReloadE2ETests:
    - E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
      W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
      reload path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:06:52 -04:00
Joseph Doherty e66b17fe5f mbproxy: Wave 2 fixes from 2026-05-14 code review
Resolves the 21 Major findings catalogued in
codereviews/2026-05-14/RemediationPlan.md (Wave 2). Tests: 370 pass / 0 fail
(baseline 363 + 7 new W2 regression tests).

Multiplexer / concurrency:
  W2.1  ConfigReconciler.Attach now threads the live coalescingAccessor through
        to add/restart-built supervisors so a hot-reload of
        ReadCoalescing.{Enabled,MaxParties} propagates to PLCs added or
        restarted via reload.
  W2.2  PlcMultiplexer._disposed and UpstreamPipe._disposed are now volatile
        for ARM/portability defense.
  W2.3  ProxyWorker._supervisors / ConfigReconciler._supervisors switched from
        Dictionary to ConcurrentDictionary; reconciler uses TryRemove. The
        outer Apply is serialised by a semaphore but the inner Add/Remove/
        Restart Task.WhenAll continuations run in parallel.
  W2.4  Counter parity for cache miss + coalescing-saturation miss documented
        inline (per-design contract; behavior unchanged).
  W2.5  _disposeCts.Dispose() and _connectGate.Dispose() guarded against late
        watchdog ticks.
  W2.6  _connectGate disposed in DisposeAsync.
  W2.7  Inline doc clarifying the post-rewriter FC byte read.

Cache / hot-reload:
  W2.8  PlcListenerSupervisor.ReplaceContextAsync now calls Clear() to capture
        the entry count, emits mbproxy.cache.flushed, then disposes the old
        cache. Previously the event was defined but never emitted.
  W2.9  Inline doc explaining the implicit "skip cache invalidation while
        recovering" gating (no backend reader during recovery → no FC06/FC16
        response → no invalidation).
  W2.10 ReloadValidator now re-checks resolved per-tag CacheTtlMs against
        Cache.AllowLongTtl after BcdTagMapBuilder folds the per-PLC default.

BCD rewriter:
  W2.11 Duplicate addresses detected within Global itself and within the per-PLC
        Add list itself, BEFORE the working dictionary collapses keys. Cross-list
        collisions (Global vs Add) remain the documented width-override pattern.
        Previously the DuplicateAddress error was unreachable dead code.
  W2.12 OverlappingHighRegister reports each colliding pair exactly once
        (canonicalised low/high pair tracked in a HashSet).
  W2.13 FC16 32-bit write rejects clientLow > 9999 or clientHigh > 9999 BEFORE
        the high*10000+low reconstruction. Without this guard, (high=9999,
        low=9999) silently re-encoded as (high=9998, low=9999), losing 1 from
        the high word.
  W2.14 FC16 validates pdu.Length >= 6 + qty*2 upfront — no half-rewritten
        requests when a malformed client claims more registers than it ships.

Supervisor:
  W2.15 WaitForInitialBindAttemptAsync now backed by TaskCompletionSource
        instead of 10ms busy-poll. Resolves race against fast Stopped→Bound→
        Stopped transitions and hangs when the supervisor task throws.
  W2.16 StartAsync refuses re-entry on a non-Stopped supervisor (was leaking
        the previous _supervisorCts).
  W2.17 New TransitionTo helper writes _state, _lastBindError, and (optionally)
        _recoveryAttempts under one lock. Snapshot() reads under the same lock
        so the status page never reports an inconsistent triple. Truncate
        helper extracted (was copy-pasted across three sites).
  W2.18 MbproxyOptionsValidator + ReloadValidator reject Connection.{Backend
        ConnectTimeoutMs, BackendRequestTimeoutMs, GracefulShutdownTimeoutMs}
        <= 0. Misconfigured 0 produces immediate CancelAfter(0) failures.

Hosting / diagnostics:
  W2.20 ProxyWorker.StopAsync supervisor-stop deadline now reads from
        IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs
        (was hard-coded 5s).
  W2.21 src/Mbproxy/appsettings.json deleted; the published file is now a Link
        to install/mbproxy.config.template.json so the binary ships with a
        usable, fully-commented example config instead of an empty stub. Tests
        strip the inherited file from their bin via an AfterTargets="Build"
        Target so they don't pick up the template's example PLCs.
  W2.22 invalidBcdWarnings (PlcPdusStatus) and codeOther (ExceptionCounts)
        added to StatusDto, plumbed through StatusSnapshotBuilder, surfaced
        in StatusHtmlRenderer table cells.
  W2.23 EventLogBridge caches EventLog.SourceExists at construction so Emit
        doesn't hit the registry on every Error+ log line.

New regression tests:
  ReloadValidatorTests:
    Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails
    Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes
    Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails
    Validate_ZeroBackendConnectTimeoutMs_Fails
    Validate_NegativeGracefulShutdownTimeoutMs_Fails
  BcdPduPipelineTests:
    FC16_32Bit_ClientHighOrLowAbove9999_PassesThroughRaw_WithInvalidBcdWarning
    FC16_TruncatedRegisterData_PassesThroughRaw_NoPartialRewrite

Reworked tests in BcdTagMapBuilderTests for the W2.11 contract (Global dup,
Add dup, Add-overrides-Global accepted as width override).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 05:48:44 -04:00
123 changed files with 5297 additions and 4377 deletions
+1 -1
View File
@@ -23,7 +23,7 @@ When in doubt about where content belongs, default to pushing it deeper. `DOCS-G
- [`graccesscli/`](graccesscli/README.md) — `.NET Framework 4.8 / x86` CliFx-based CLI for automating Galaxy configuration through the ArchestrA GRAccess COM interop. - [`graccesscli/`](graccesscli/README.md) — `.NET Framework 4.8 / x86` CliFx-based CLI for automating Galaxy configuration through the ArchestrA GRAccess COM interop.
- [`grdb/`](grdb/README.md) — SQL/DDL exploration of the Galaxy Repository SQL database (queries, schema, hierarchy/tag-name translation). - [`grdb/`](grdb/README.md) — SQL/DDL exploration of the Galaxy Repository SQL database (queries, schema, hierarchy/tag-name translation).
- [`histdb/`](histdb/README.md) — LLM-oriented reference for AVEVA Historian retrieval (extension tables, `wwXxx` time-domain extensions, retrieval modes/options, alarm-event SQL, REST API). Distilled from the official Historian Retrieval Guide. - [`histdb/`](histdb/README.md) — LLM-oriented reference for AVEVA Historian retrieval (extension tables, `wwXxx` time-domain extensions, retrieval modes/options, alarm-event SQL, REST API). Distilled from the official Historian Retrieval Guide.
- [`mbproxy/`](mbproxy/README.md) — `.NET 10` Windows Service that proxies Modbus TCP for a fleet of ~54 DL205/DL260 PLCs: inline bidirectional BCD rewriting, single-backend-conn TxId multiplexing (lifts the H2-ECOM100 4-client cap), in-flight read coalescing, and opt-in per-tag response caching. - [`mbproxy/`](mbproxy/README.md) — `.NET 10` background service (Windows Service or Linux systemd unit) that proxies Modbus TCP for a fleet of ~54 DL205/DL260 PLCs: inline bidirectional BCD rewriting, single-backend-conn TxId multiplexing (lifts the H2-ECOM100 4-client cap), in-flight read coalescing, and opt-in per-tag response caching.
- [`mxaccesscli/`](mxaccesscli/README.md) — `.NET Framework 4.8 / x86` CliFx-based CLI for reading, writing, and subscribing to System Platform tags via the **MxAccess** COM proxy (`LMXProxyServerClass`). - [`mxaccesscli/`](mxaccesscli/README.md) — `.NET Framework 4.8 / x86` CliFx-based CLI for reading, writing, and subscribing to System Platform tags via the **MxAccess** COM proxy (`LMXProxyServerClass`).
- [`secrets/`](secrets/README.md) — Self-hosted Infisical CLI + `secret` PowerShell helper for fetching credentials from `https://infisical.dohertylan.com` instead of inlining plaintext. - [`secrets/`](secrets/README.md) — Self-hosted Infisical CLI + `secret` PowerShell helper for fetching credentials from `https://infisical.dohertylan.com` instead of inlining plaintext.
+2 -1
View File
@@ -1,13 +1,14 @@
# Build output # Build output
bin/ bin/
obj/ obj/
publish-out/
# Visual Studio artifacts # Visual Studio artifacts
.vs/ .vs/
*.user *.user
*.suo *.suo
# Test simulator Python venv (phase 01 onward) # Test simulator Python venv
tests/sim/.venv/ tests/sim/.venv/
# mbproxy runtime logs (default location, see appsettings.json) # mbproxy runtime logs (default location, see appsettings.json)
+27 -24
View File
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
## What this is ## What this is
`mbproxy` is a **C# .NET 10** background service (Windows Service) that sits **inline as a Modbus TCP proxy** in front of a fleet of **~54 AutomationDirect DirectLOGIC DL205 / DL260** equipment controllers. It is pre-configured with two pieces of static data: `mbproxy` is a **C# .NET 10** background service — a **Windows Service** or a **Linux systemd unit** that sits **inline as a Modbus TCP proxy** in front of a fleet of **~54 AutomationDirect DirectLOGIC DL205 / DL260** equipment controllers. It is pre-configured with two pieces of static data:
1. **A list of BCD tags** — the holding/input registers (by Modbus address and bit width) that the controllers store in DirectLOGIC's native BCD encoding (`V2000 = 1234` is stored on the wire as `0x1234`, *not* `0x04D2`). 1. **A list of BCD tags** — the holding/input registers (by Modbus address and bit width) that the controllers store in DirectLOGIC's native BCD encoding (`V2000 = 1234` is stored on the wire as `0x1234`, *not* `0x04D2`).
2. **A list of equipment controller IP addresses** (~54 entries) for the DL205/DL260 fleet. Each controller speaks Modbus TCP on port 502 via either the built-in DL260 Ethernet port or an H2-ECOM100 / H2-EBC100 coprocessor. 2. **A list of equipment controller IP addresses** (~54 entries) for the DL205/DL260 fleet. Each controller speaks Modbus TCP on port 502 via either the built-in DL260 Ethernet port or an H2-ECOM100 / H2-EBC100 coprocessor.
@@ -21,7 +21,7 @@ The integration win is that upstream consumers (Wonderware / Historian / OPC UA
## Architecture ## Architecture
The full design plan is in **[`docs/design.md`](docs/design.md)** — settled 2026-05-13, updated for Phase 9 multiplexing on 2026-05-14. Headline choices the agent should keep in mind without opening that file: The full architecture is documented under **[`docs/`](docs/)** — see the `Architecture/`, `Features/`, `Operations/`, `Reference/`, and `Testing/` pages. Headline choices the agent should keep in mind without opening those files:
- **One `TcpListener` per PLC** (54 distinct ports). Each PLC has **one shared backend socket** owned by a `PlcMultiplexer`; many upstream clients are multiplexed onto that single backend via MBAP TxId rewriting (Phase 9). The H2-ECOM100's 4-client cap no longer caps upstream connections. - **One `TcpListener` per PLC** (54 distinct ports). Each PLC has **one shared backend socket** owned by a `PlcMultiplexer`; many upstream clients are multiplexed onto that single backend via MBAP TxId rewriting (Phase 9). The H2-ECOM100's 4-client cap no longer caps upstream connections.
- **Transparent by default; opt-in cached** (Phase 11). Every byte passes through unchanged except the MBAP TxId field (rewritten by the multiplexer on each request and restored on each response) and FC03/FC04 response payloads + FC06/FC16 request payloads at configured BCD addresses (re-encoded between BCD nibbles and binary integers). With Phase 11, FC03/FC04 reads for tags whose `CacheTtlMs > 0` may be served from a per-PLC in-process cache without backend traffic; the cache is **OFF by default** per tag. - **Transparent by default; opt-in cached** (Phase 11). Every byte passes through unchanged except the MBAP TxId field (rewritten by the multiplexer on each request and restored on each response) and FC03/FC04 response payloads + FC06/FC16 request payloads at configured BCD addresses (re-encoded between BCD nibbles and binary integers). With Phase 11, FC03/FC04 reads for tags whose `CacheTtlMs > 0` may be served from a per-PLC in-process cache without backend traffic; the cache is **OFF by default** per tag.
@@ -31,60 +31,63 @@ The full design plan is in **[`docs/design.md`](docs/design.md)** — settled 20
- **`appsettings.json` is hot-reloadable** via `IOptionsMonitor<MbproxyOptions>`; tag-list changes propagate per-PDU, PLC add/remove flows through the supervisor. A tag-list reload flushes the affected PLC's response cache (per-tag granularity intentionally not done in v1). - **`appsettings.json` is hot-reloadable** via `IOptionsMonitor<MbproxyOptions>`; tag-list changes propagate per-PDU, PLC add/remove flows through the supervisor. A tag-list reload flushes the affected PLC's response cache (per-tag granularity intentionally not done in v1).
- **Polly bounded retries** on backend connect (3 attempts at 100ms / 500ms / 2000ms). No retries on mid-request failures (FC06/FC16 are non-idempotent on BCD tags). A per-request watchdog in the multiplexer surfaces Modbus exception 0x0B to the upstream client if a backend response never arrives within `BackendRequestTimeoutMs`. - **Polly bounded retries** on backend connect (3 attempts at 100ms / 500ms / 2000ms). No retries on mid-request failures (FC06/FC16 are non-idempotent on BCD tags). A per-request watchdog in the multiplexer surfaces Modbus exception 0x0B to the upstream client if a backend response never arrives within `BackendRequestTimeoutMs`.
- **Backend disconnect cascades upstream**: when the shared backend socket dies, every attached upstream pipe is closed in the same cycle (counter `BackendDisconnectCascades`); clients reconnect on their next request. - **Backend disconnect cascades upstream**: when the shared backend socket dies, every attached upstream pipe is closed in the same cycle (counter `BackendDisconnectCascades`); clients reconnect on their next request.
- **Keepalive / connection monitoring** (ON by default, `Connection.Keepalive`): OS `SO_KEEPALIVE` on backend and accepted upstream sockets, plus a per-PLC application heartbeat — a synthetic FC03 qty=1 read fired on an idle backend socket (`BackendHeartbeatIdleMs`). An unanswered heartbeat proactively tears the backend down (counters `backendHeartbeatsSent/Failed`, `backendIdleDisconnects`). The DL260 has no FC08, so the probe is a real register read. See [`docs/Architecture/Keepalive.md`](docs/Architecture/Keepalive.md).
- **Read-only Kestrel admin port** (default 8080) exposes `GET /` (auto-refreshing HTML) and `GET /status.json` with service-wide and per-PLC counters (including Phase-9 mux fields, Phase-10 coalescing fields, and Phase-11 cache fields `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes`). - **Read-only Kestrel admin port** (default 8080) exposes `GET /` (auto-refreshing HTML) and `GET /status.json` with service-wide and per-PLC counters (including Phase-9 mux fields, Phase-10 coalescing fields, and Phase-11 cache fields `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes`).
Anything beyond this short list — JSON schema, propagation table, stable log event names, status counter catalog, test plan — lives in `docs/design.md`. Open that doc before writing code; keep it in sync when decisions change. Anything beyond this short list lives in the `docs/` tree: the appsettings.json schema in [`docs/Operations/Configuration.md`](docs/Operations/Configuration.md), config propagation in [`docs/Features/HotReload.md`](docs/Features/HotReload.md), stable log event names in [`docs/Reference/LogEvents.md`](docs/Reference/LogEvents.md), the status counter catalog in [`docs/Operations/StatusPage.md`](docs/Operations/StatusPage.md), and the simulator-backed test fixture in [`docs/Testing/Simulator.md`](docs/Testing/Simulator.md). Open the relevant page before writing code; keep it in sync when decisions change.
## Current state ## Current state
**Implementation complete through Phase 11.** Phases 0008 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model; Phase 10 added in-flight read coalescing on top; Phase 11 added an opt-in per-tag response cache (bounded staleness, OFF by default — see "Response cache" in `docs/design.md`). The service is production-ready as a Windows Service: **Implementation complete through Phase 11.** Phases 0008 shipped the production-ready 1:1-model service; Phase 9 swapped the connection layer for the TxId-multiplexed model; Phase 10 added in-flight read coalescing on top; Phase 11 added an opt-in per-tag response cache (bounded staleness, OFF by default — see [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md)). The service is production-ready as a **Windows Service or a Linux systemd unit**:
- Test count grew through Phase 11 (see `tests/Mbproxy.Tests/` for the current suite; previous baseline was 325 = 282 unit + 43 E2E). - Test count grew through Phase 11 (see `tests/Mbproxy.Tests/` for the current suite; previous baseline was 325 = 282 unit + 43 E2E).
- Single-file self-contained publish (`dotnet publish -c Release -r win-x64`). - Single-file self-contained publish for `win-x64` **and** `linux-x64` (`dotnet publish -c Release -r <rid>`) — the RID is supplied per publish, never hardcoded in the csproj.
- PowerShell install/uninstall scripts under `install/`. - Install/uninstall scripts under `install/`: PowerShell (`install.ps1` / `uninstall.ps1`) for the Windows Service; shell (`install.sh` / `uninstall.sh` + the `mbproxy.service` unit) for systemd.
- Graceful shutdown with configurable drain timeout (`Connection.GracefulShutdownTimeoutMs`, default 10 s). - Graceful shutdown with configurable drain timeout (`Connection.GracefulShutdownTimeoutMs`, default 10 s) — driven by the Windows SCM stop signal or POSIX `SIGTERM`.
- Windows Event Log integration (Error+ events when running as a service). - Platform diagnostic sink for Error+ events, chosen once at the composition root by `DiagnosticSinkSelector`: Windows Application Event Log under the SCM, local syslog under systemd, none for interactive/dev runs. The systemd unit is `Type=exec` (not `notify`).
- Read-only HTTP status page at `AdminPort` (default 8080) — surfaces Phase-9 mux fields alongside Phase-7 counters. - Read-only HTTP status page at `AdminPort` (default 8080) — surfaces Phase-9 mux fields alongside Phase-7 counters.
- `connectsSuccess` / `connectsFailed` counters wired in `PlcMultiplexer`. - `connectsSuccess` / `connectsFailed` counters wired in `PlcMultiplexer`.
- Phase 9 per-request watchdog defends against any backend that drops or mis-echoes a response (real-world packet loss; pymodbus 3.13 simulator's concurrent-multiplexed-request bug). - Phase 9 per-request watchdog defends against any backend that drops or mis-echoes a response (real-world packet loss; pymodbus 3.13 simulator's concurrent-multiplexed-request bug).
- `AssemblyInformationalVersion` set to `1.0.0` (CI can override via `/p:InformationalVersion=...`). - `AssemblyInformationalVersion` set to `1.0.0` (CI can override via `/p:InformationalVersion=...`).
The human-facing entry point is **[`README.md`](README.md)**. All design decisions remain in [`docs/design.md`](docs/design.md). The human-facing entry point is **[`README.md`](README.md)**. All design decisions live in the [`docs/`](docs/) tree.
Constraints that still apply to this codebase (do not change without updating the design doc): Constraints that still apply to this codebase (do not change without updating the relevant `docs/` page):
- The csproj targets **.NET 10** (`net10.0`). This is the **only** tool in `wwtools/` not pinned to .NET Framework 4.8 / x86. - The csproj targets **.NET 10** (`net10.0`). This is the **only** tool in `wwtools/` not pinned to .NET Framework 4.8 / x86.
- The sample test `DL260/DL205BcdQuirkTests.cs` is a pattern reference only — its types are not available in this project.
## Device quirks (read before writing Modbus code) ## Device quirks (read before writing Modbus code)
The DL205/DL260 family is *almost* Modbus-spec-compliant, but every category below has at least one trap. The authoritative reference is **[`DL260/dl205.md`](DL260/dl205.md)** — read it end-to-end before touching the wire protocol. Highlights that bear directly on this proxy: The DL205/DL260 family is *almost* Modbus-spec-compliant, but every category below has at least one trap. The authoritative reference is **[`docs/Reference/dl205.md`](docs/Reference/dl205.md)** — read it end-to-end before touching the wire protocol. Highlights that bear directly on this proxy:
- **BCD-by-default numeric encoding.** `V2000 = 1234` stores `0x1234` on the wire, not `0x04D2`. This is the entire reason this service exists. - **BCD-by-default numeric encoding.** `V2000 = 1234` stores `0x1234` on the wire, not `0x04D2`. This is the entire reason this service exists.
- **CDAB word order for 32-bit values.** Low word first, big-endian bytes within each word. `0xAABBCCDD` lands as `[0xCC 0xDD][0xAA 0xBB]`. - **CDAB word order for 32-bit values.** Low word first, big-endian bytes within each word. `0xAABBCCDD` lands as `[0xCC 0xDD][0xAA 0xBB]`.
- **Octal V-memory ↔ decimal Modbus translation.** `V2000` octal = decimal 1024 = Modbus PDU `0x0400`. Config addresses are PDU-decimal, **not** octal V-memory and **not** 1-based 4xxxx. - **Octal V-memory ↔ decimal Modbus translation.** `V2000` octal = decimal 1024 = Modbus PDU `0x0400`. Config addresses are PDU-decimal, **not** octal V-memory and **not** 1-based 4xxxx.
- **FC03/FC04 max qty = 128** (above spec's 125). **FC16 max qty = 100** (below spec's 123). The proxy passes these through; the PLC enforces the cap with exception 03. - **FC03/FC04 max qty = 128** (above spec's 125). **FC16 max qty = 100** (below spec's 123). The proxy passes these through; the PLC enforces the cap with exception 03.
- **Max 4 concurrent TCP clients per ECOM100.** Direct constraint on this proxy's 1:1 connection model — see [`docs/design.md`](docs/design.md) → "Connection model" for the band-aid-vs-rearchitect decision tree if this becomes a real problem. - **Max 4 concurrent TCP clients per ECOM100.** This is why the proxy uses a single TxId-multiplexed backend socket per PLC — see [`docs/Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) for how the connection model lifts this cap.
- **No TCP keepalive from the device.** Middleboxes typically drop idle sockets at 25 min. With the 1:1 model, backend liveness tracks upstream client liveness; if both are idle long enough, the path dies on its own and the next request reconnects. - **No TCP keepalive from the device.** Middleboxes typically drop idle sockets at 25 min. The proxy compensates with its own keepalive — `SO_KEEPALIVE` on every socket plus an idle backend FC03 heartbeat (see the Architecture summary and [`docs/Architecture/Keepalive.md`](docs/Architecture/Keepalive.md)).
- **Register 0 is valid** on DL205/DL260 in factory "absolute" addressing mode — don't probe-skip it. - **Register 0 is valid** on DL205/DL260 in factory "absolute" addressing mode — don't probe-skip it.
- **As-deployed PLC parameters** (captured in `DL260/mbtcp_settings.JPG`): port 502, "Use Concept data structures (Longs/Reals)" enabled, "Swap bytes" enabled, "Use Zero Based Addressing" **unchecked**, Register type = Binary, max coil read 1976 / coil write 800 / register read 122 / register write 100. The proxy must speak Modbus as-is; these settings describe the wire it'll see. - **As-deployed PLC parameters** (captured in `docs/Reference/mbtcp_settings.JPG`): port 502, "Use Concept data structures (Longs/Reals)" enabled, "Swap bytes" enabled, "Use Zero Based Addressing" **unchecked**, Register type = Binary, max coil read 1976 / coil write 800 / register read 122 / register write 100. The proxy must speak Modbus as-is; these settings describe the wire it'll see.
## Resource index ## Resource index
| Task | Go to | | Task | Go to |
| --- | --- | | --- | --- |
| Full architecture / design plan (decisions, schema, log events, status counters, test plan) | [`docs/design.md`](docs/design.md) | | Architecture — listener topology, request flow, per-PLC isolation | [`docs/Architecture/Overview.md`](docs/Architecture/Overview.md) |
| Phase-by-phase implementation plan (parallel-safety, phase gates, per-phase test list) | [`docs/plan/README.md`](docs/plan/README.md) | | Connection model — single backend socket per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade | [`docs/Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) |
| Dashboard KPI catalogue — what's exposed today and proposed additions (rates, percentiles, availability, fleet aggregates) | [`docs/kpi.md`](docs/kpi.md) | | Keepalive / connection monitoring — TCP `SO_KEEPALIVE` + backend FC03 heartbeat | [`docs/Architecture/Keepalive.md`](docs/Architecture/Keepalive.md) |
| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits, exception codes, oddities) | [`DL260/dl205.md`](DL260/dl205.md) | | In-flight read coalescing / opt-in response cache | [`docs/Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md), [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) |
| pymodbus simulator profile that models those quirks as concrete register values | [`DL260/dl205.json`](DL260/dl205.json) | | BCD rewriting (codec, CDAB word order, FC03/04/06/16 scope) and config hot-reload | [`docs/Features/BcdRewriting.md`](docs/Features/BcdRewriting.md), [`docs/Features/HotReload.md`](docs/Features/HotReload.md) |
| Example integration test pattern (xUnit + Shouldly + simulator fixture) | [`DL260/DL205BcdQuirkTests.cs`](DL260/DL205BcdQuirkTests.cs) | | Operations — full appsettings.json reference, status page / JSON fields, troubleshooting playbook | [`docs/Operations/Configuration.md`](docs/Operations/Configuration.md), [`docs/Operations/StatusPage.md`](docs/Operations/StatusPage.md), [`docs/Operations/Troubleshooting.md`](docs/Operations/Troubleshooting.md) |
| As-deployed PLC Modbus parameters screenshot | [`DL260/mbtcp_settings.JPG`](DL260/mbtcp_settings.JPG) | | Stable `mbproxy.*` log event-name catalog | [`docs/Reference/LogEvents.md`](docs/Reference/LogEvents.md) |
| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits, exception codes, oddities) | [`docs/Reference/dl205.md`](docs/Reference/dl205.md) |
| pymodbus simulator profile that models those quirks as concrete register values | [`tests/sim/dl205.json`](tests/sim/dl205.json) |
| As-deployed PLC Modbus parameters screenshot | [`docs/Reference/mbtcp_settings.JPG`](docs/Reference/mbtcp_settings.JPG) |
## Maintenance ## Maintenance
Documentation doctrine for `wwtools/` lives in [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md). The three-layer rules apply: Documentation doctrine for `wwtools/` lives in [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md). The three-layer rules apply:
- **[`README.md`](README.md)** is the canonical human entry point (Layer-2 per DOCS-GUIDE). It routes to deep docs; it does not duplicate them. Update it when the service's public surface or install steps change. - **[`README.md`](README.md)** is the canonical human entry point (Layer-2 per DOCS-GUIDE). It routes to deep docs; it does not duplicate them. Update it when the service's public surface or install steps change.
- This `CLAUDE.md` stays a router for LLM coding agents. Deep design decisions live in [`docs/design.md`](docs/design.md); device quirks live in [`DL260/dl205.md`](DL260/dl205.md). When you change a design decision, update `docs/design.md` first (it's the source of truth) and only mirror the change into the Architecture summary above if it shifts one of the headline bullets. - This `CLAUDE.md` stays a router for LLM coding agents. Deep design decisions live in the [`docs/`](docs/) tree; device quirks live in [`docs/Reference/dl205.md`](docs/Reference/dl205.md). When you change a design decision, update the relevant page under `docs/` first (it's the source of truth) and only mirror the change into the Architecture summary above if it shifts one of the headline bullets.
- When the service's task→tool mapping changes in the root index, update [`../CLAUDE.md`](../CLAUDE.md) too. - When the service's task→tool mapping changes in the root index, update [`../CLAUDE.md`](../CLAUDE.md) too.
- Any further work beyond Phase 08 belongs in a new design revision (dated, in `docs/design.md`) and a new phase plan. - Any further design changes belong in the relevant `docs/` page (`Architecture/`, `Features/`, `Operations/`, `Reference/`, or `Testing/`).
-56
View File
@@ -1,56 +0,0 @@
using Shouldly;
using Xunit;
namespace ZB.MOM.WW.OtOpcUa.Driver.Modbus.IntegrationTests.DL205;
/// <summary>
/// Verifies DL205/DL260 binary-coded-decimal register handling against the
/// <c>dl205.json</c> pymodbus profile. HR[1072] = 0x1234 on the profile represents
/// decimal 1234 (BCD nibbles). Reading it as <see cref="ModbusDataType.Int16"/> would
/// return 0x1234 = 4660; the <see cref="ModbusDataType.Bcd16"/> path decodes 1234.
/// </summary>
[Collection(ModbusSimulatorCollection.Name)]
[Trait("Category", "Integration")]
[Trait("Device", "DL205")]
public sealed class DL205BcdQuirkTests(ModbusSimulatorFixture sim)
{
[Fact]
public async Task DL205_BCD16_decodes_HR1072_as_decimal_1234()
{
if (sim.SkipReason is not null) Assert.Skip(sim.SkipReason);
if (!string.Equals(Environment.GetEnvironmentVariable("MODBUS_SIM_PROFILE"), "dl205",
StringComparison.OrdinalIgnoreCase))
{
Assert.Skip("MODBUS_SIM_PROFILE != dl205 — skipping (standard profile does not seed HR[1072]).");
}
var options = new ModbusDriverOptions
{
Host = sim.Host,
Port = sim.Port,
UnitId = 1,
Timeout = TimeSpan.FromSeconds(2),
Tags =
[
new ModbusTagDefinition("DL205_Count_Bcd",
ModbusRegion.HoldingRegisters, Address: 1072,
DataType: ModbusDataType.Bcd16, Writable: false),
new ModbusTagDefinition("DL205_Count_Int16",
ModbusRegion.HoldingRegisters, Address: 1072,
DataType: ModbusDataType.Int16, Writable: false),
],
Probe = new ModbusProbeOptions { Enabled = false },
};
await using var driver = new ModbusDriver(options, driverInstanceId: "dl205-bcd");
await driver.InitializeAsync("{}", TestContext.Current.CancellationToken);
var results = await driver.ReadAsync(["DL205_Count_Bcd", "DL205_Count_Int16"],
TestContext.Current.CancellationToken);
results[0].StatusCode.ShouldBe(0u);
results[0].Value.ShouldBe(1234, "DL205 BCD register 0x1234 represents decimal 1234 per the DirectLOGIC convention");
results[1].StatusCode.ShouldBe(0u);
results[1].Value.ShouldBe((short)0x1234, "same register read as Int16 returns the raw 0x1234 = 4660 value — proves BCD path is distinct");
}
}
+50 -28
View File
@@ -1,12 +1,14 @@
# mbproxy # mbproxy
A .NET 10 Windows Service that sits inline as a Modbus TCP proxy in front of a fleet of AutomationDirect DirectLOGIC DL205/DL260 controllers, rewriting BCD-encoded registers bidirectionally so upstream clients can read and write them as plain integers. The proxy also offers an opt-in per-tag response cache (default OFF) for FC03/FC04 reads with bounded operator-configured staleness — see [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) before enabling it. A .NET 10 background service — a **Windows Service** or a **Linux systemd unit** that sits inline as a Modbus TCP proxy in front of a fleet of AutomationDirect DirectLOGIC DL205/DL260 controllers, rewriting BCD-encoded registers bidirectionally so upstream clients can read and write them as plain integers. The proxy also offers an opt-in per-tag response cache (default OFF) for FC03/FC04 reads with bounded operator-configured staleness — see [`docs/Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) before enabling it.
> ⚠ **32-bit BCD wire format is "two base-10000 digits in CDAB", not standard CDAB binary Int32.** A 32-bit BCD tag at address `A` decodes as `decimal = high * 10_000 + low` where `low` is the register at `A` and `high` is the register at `A+1`. Each word independently must be 09999. Standard Modbus clients (NModbus, FluentModbus, Wonderware DAServer) that interpret CDAB as straight binary Int32 will silently corrupt any value > 9999 on writes and read garbage on reads. Configure your client to send/receive each register as a separate base-10000 BCD digit pair, not as a single binary Int32. Full details in [`docs/Features/BcdRewriting.md`](docs/Features/BcdRewriting.md).
## Hard constraints / prerequisites ## Hard constraints / prerequisites
- **Windows 10 / Server 2019 or later, 64-bit.** No Linux or Docker support — the service uses `Microsoft.Extensions.Hosting.WindowsServices` and the Windows Event Log. - **Windows (10 / Server 2019+) or Linux (any systemd distro), 64-bit.** Ships as a Windows Service (Application Event Log integration) or a systemd unit (syslog integration); builds single-file for `win-x64` and `linux-x64`. macOS is not a deployment target — it runs only as a foreground console process.
- **Modbus TCP backends reachable** from the proxy host on port 502 (or the port configured per PLC). The H2-ECOM100 module caps simultaneous connections at **4 per PLC** — a fifth upstream client will fail to connect. - **Modbus TCP backends reachable** from the proxy host on port 502 (or the port configured per PLC). The H2-ECOM100 module caps simultaneous connections at **4 per PLC** — a fifth upstream client will fail to connect.
- **Admin rights** to install the service (`install.ps1` requires elevation). - **Admin / root rights** to install the service (`install.ps1` requires elevation; `install.sh` requires root).
- **No COM dependency** — this is a pure .NET 10 socket-level proxy (unlike the `.NET Framework 4.8 / x86` siblings in this repo). - **No COM dependency** — this is a pure .NET 10 socket-level proxy (unlike the `.NET Framework 4.8 / x86` siblings in this repo).
- **Python 3.10+** on the test machine to run the pymodbus-backed E2E simulator (not needed to run the service in production). - **Python 3.10+** on the test machine to run the pymodbus-backed E2E simulator (not needed to run the service in production).
@@ -14,27 +16,23 @@ A .NET 10 Windows Service that sits inline as a Modbus TCP proxy in front of a f
``` ```
src/Mbproxy/ Main C# project (net10.0, Microsoft.NET.Sdk.Worker) src/Mbproxy/ Main C# project (net10.0, Microsoft.NET.Sdk.Worker)
tests/Mbproxy.Tests/ xUnit v3 test project (314 unit + 48 E2E tests) tests/Mbproxy.Tests/ xUnit v3 test project (unit + simulator-backed E2E tests)
install/ PowerShell install/uninstall scripts and config template install/ Install/uninstall + publish scripts (PowerShell + shell), systemd unit, config templates
docs/ Architecture, features, operations, reference, and testing docs docs/ Architecture, features, operations, reference, and testing docs
DL260/ DL205/DL260 reference material and pymodbus simulator profile
``` ```
## Resource index ## Resource index
| Task | Go to | | Task | Go to |
|---|---| |---|---|
| End-to-end architectural design (entry point — routes into focused docs below) | [`docs/design.md`](docs/design.md) | | Architecture entry point — listener topology, request flow, per-PLC isolation | [`docs/Architecture/Overview.md`](docs/Architecture/Overview.md) |
| Phase-by-phase implementation plan and history | [`docs/plan/README.md`](docs/plan/README.md) | | DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits) | [`docs/Reference/dl205.md`](docs/Reference/dl205.md) |
| Install, upgrade, uninstall, log file locations, first-install smoke checklist | [`docs/operations.md`](docs/operations.md) | | pymodbus simulator profile (register seeds for E2E tests) | [`tests/sim/dl205.json`](tests/sim/dl205.json) |
| Dashboard KPI catalog | [`docs/kpi.md`](docs/kpi.md) |
| DL205/DL260 Modbus quirks (BCD, CDAB, octal V-memory, FC limits) | [`DL260/dl205.md`](DL260/dl205.md) |
| pymodbus simulator profile (register seeds for E2E tests) | [`DL260/dl205.json`](DL260/dl205.json) |
| Agent-oriented coding guide (architecture bullets, device quirks, phase context) | [`CLAUDE.md`](CLAUDE.md) | | Agent-oriented coding guide (architecture bullets, device quirks, phase context) | [`CLAUDE.md`](CLAUDE.md) |
## Detailed documentation ## Detailed documentation
The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/design.md) for the canonical end-to-end design; jump to the focused pages below when you need depth on one area. The `docs/` tree is organized by topic. Start with [`Architecture/Overview.md`](docs/Architecture/Overview.md) for the end-to-end picture; jump to the focused pages below when you need depth on one area.
### Architecture ### Architecture
@@ -42,6 +40,7 @@ The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/desig
- [`Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) — Single backend connection per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade. - [`Architecture/ConnectionModel.md`](docs/Architecture/ConnectionModel.md) — Single backend connection per PLC, TxId multiplexing, request-timeout watchdog, disconnect cascade.
- [`Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md) — In-flight FC03/FC04 deduplication via `InFlightByKeyMap`. - [`Architecture/ReadCoalescing.md`](docs/Architecture/ReadCoalescing.md) — In-flight FC03/FC04 deduplication via `InFlightByKeyMap`.
- [`Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) — Opt-in per-tag response cache with bounded operator-configured staleness. - [`Architecture/ResponseCache.md`](docs/Architecture/ResponseCache.md) — Opt-in per-tag response cache with bounded operator-configured staleness.
- [`Architecture/Keepalive.md`](docs/Architecture/Keepalive.md) — TCP `SO_KEEPALIVE` on every socket plus an idle-backend FC03 heartbeat.
### Features ### Features
@@ -56,7 +55,7 @@ The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/desig
### Reference ### Reference
- [`Reference/LogEvents.md`](docs/Reference/LogEvents.md) — Stable `mbproxy.*` event catalog (28 events across 7 categories). - [`Reference/LogEvents.md`](docs/Reference/LogEvents.md) — Stable `mbproxy.*` event catalog (31 events across 8 categories).
### Testing ### Testing
@@ -70,13 +69,27 @@ The `docs/` tree is organized by topic. Start with [`docs/design.md`](docs/desig
dotnet build Mbproxy.slnx -c Debug dotnet build Mbproxy.slnx -c Debug
``` ```
**Publish (Release, single-file self-contained, win-x64):** **Publish (Release, single-file):**
```powershell ```powershell
dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true -o C:\build\mbproxy-publish .\install\publish.ps1 -Clean # win-x64 (default)
.\install\publish.ps1 -Rid linux-x64 -Clean # cross-publish for linux-x64
``` ```
The published output is a single `Mbproxy.exe` (~100 MB). The self-contained publish bundles the full .NET 10 + ASP.NET Core runtime. No .NET installation is required on the target machine. On a Linux build host, use the shell counterpart:
```bash
./install/publish.sh --clean # linux-x64 (default)
```
Each run produces both flavours under `publish-out\`:
| Flavour | Path (win-x64) | Size | Target prerequisite |
|---|---|---|---|
| Self-contained | `publish-out\self-contained\Mbproxy.exe` | ~100 MB | None — bundles .NET 10 + ASP.NET Core runtime |
| Framework-dependent | `publish-out\framework-dependent\Mbproxy.exe` | ~1.6 MB | .NET 10 + ASP.NET Core preinstalled |
On `linux-x64` the binary is `Mbproxy` (no extension) and ships the Linux config template. Pass `-OutputDir`/`-o` to publish elsewhere; omit `-Clean`/`--clean` to skip the wipe. The scripts wrap `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r <rid> [-p:SelfContained=false]` — run that directly if you only need one flavour.
**Run tests:** **Run tests:**
@@ -97,25 +110,34 @@ Edit `src/Mbproxy/appsettings.json` to configure PLCs before running. The admin
## Install ## Install
Full detail is in [`docs/operations.md`](docs/operations.md). Quick path: The `install/` directory holds the publish, install, and uninstall scripts for both platforms.
**Windows** — elevated PowerShell:
```powershell ```powershell
# 1. Publish .\install\publish.ps1 -Clean
dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true -o C:\build\mbproxy-publish .\install\install.ps1 -PublishOutput .\publish-out\self-contained -Start
# Config is placed at %ProgramData%\mbproxy\appsettings.json — edit it, then:
# 2. Install (elevated PowerShell) # Restart-Service mbproxy
.\install\install.ps1 -PublishOutput C:\build\mbproxy-publish -Start
# 3. Edit the config that was placed at %ProgramData%\mbproxy\appsettings.json
# 4. Verify
Invoke-WebRequest http://localhost:8080/ -UseBasicParsing Invoke-WebRequest http://localhost:8080/ -UseBasicParsing
``` ```
**Linux** — root / `sudo` on a systemd host:
```bash
./install/publish.sh --clean
sudo ./install/install.sh --publish-dir ./publish-out/self-contained
# Config is placed at /etc/mbproxy/appsettings.json — edit it, then:
# sudo systemctl restart mbproxy
curl http://localhost:8080/
```
`uninstall.ps1` / `uninstall.sh` reverse the install; both archive log files rather than deleting them. The systemd unit runs mbproxy as `Type=exec` under a dedicated `mbproxy` service account.
## Maintenance ## Maintenance
Documentation doctrine for this repo: [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md). Documentation doctrine for this repo: [`../DOCS-GUIDE.md`](../DOCS-GUIDE.md).
- This README routes to deep docs — it does not duplicate them. - This README routes to deep docs — it does not duplicate them.
- Design decisions: [`docs/design.md`](docs/design.md) is the source of truth. - Design decisions and rationale live in the `docs/` tree (Architecture, Features, Operations, Reference, Testing).
- When the service's public surface or task→tool mapping changes, update this README and the root [`../CLAUDE.md`](../CLAUDE.md) index row. - When the service's public surface or task→tool mapping changes, update this README and the root [`../CLAUDE.md`](../CLAUDE.md) index row.
@@ -0,0 +1,148 @@
# Re-Review After Remediation — 2026-05-14
Re-review of the codebase after the six-commit remediation of the original 2026-05-14 review (Wave 1 → `ce32c5c`, Wave 2 → `e66b17f`, Wave 3 → `7ead358`, the easy 5 → `2545237`, the race-hard 5 → `53f842a`). Conducted via three parallel area-focused passes. **Eyes on what the fixes themselves introduced**, not what the original review already found.
**Scope:** every src/ and tests/ change in `53a7111..HEAD` (37 files, ~+2000/700 lines).
## Status
> **All actionable findings resolved across two re-review passes.** Wave 4 (`7a43595`) closed NC1 + NM1 + NM2 + NM5 + Nm1 + T2. Wave 4-followup (`9251c56`) closed NM3 + NM4 + Nm6 + Nm7 + T3 + T4. A third focused pass surfaced one more major (W5/M1) and two cosmetics (W5/m1, W5/m2); Wave 5 (this commit) resolved M1 + m2 and documented m1 as accepted best-effort.
>
> **Final test count:** 387 pass / 0 fail.
## Headline
The remediation was structurally sound. The re-review found:
- **1 critical finding** — the W1.5 drain loop inherited the very `ShutdownCoordinator` bug it was meant to replace. **Resolved** by Wave 4 (`7a43595`) snapshotting the in-flight count BEFORE supervisor stop and deleting the theatrical post-stop loop.
- **5 major findings** clustered in W1.4 cascade gating and W1.1 + W1.5 lifecycle ordering. **All resolved** by Wave 4 + Wave 4-followup.
- **8 minor findings** + **4 test-discipline findings**. Most resolved; the rest accepted with documented rationale.
## Resolution table
| ID | Severity | Finding | Status | Commit |
|----|----------|---------|--------|--------|
| **NC1** | Critical | `ProxyWorker.StopAsync` drain loop structurally always-zero — inherited the original `ShutdownCoordinator` bug | ✅ **Resolved** | `7a43595` |
| **NM1** | Major | `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable — disposal can be blocked indefinitely | ✅ **Resolved** | `7a43595` |
| **NM2** | Major | `ReplaceContext` writes `_ctx` and re-registers cache stats provider non-atomically — snapshot can read a disposed cache | ✅ **Resolved** | `7a43595` |
| **NM3** | Major | `_supervisorCts` leaks across `StartAsync` re-entry despite W2.16 guard | ✅ **Resolved** | (W4-followup) |
| **NM4** | Major | W2.15 TCS never re-armed — supervisor effectively single-shot | ✅ **Resolved** | (W4-followup) |
| **NM5** | Major | Self-cascade swallows `ObjectDisposedException` from `_connectGate` after disposal | ✅ **Resolved** | `7a43595` |
| **Nm1** | Minor | Saturation cleanup uses `await SendResponseAsync` (blocking) per attached pipe | ✅ **Resolved** | `7a43595` |
| **Nm2** | Minor | W1.2 increments `CoalescedHit` for late attachers that ultimately receive exception 04 | ⚪ **Accepted** | (doc) |
| **Nm3** | Minor | Both supervisor-stop and drain shared `gracefulMs` budget | ✅ **Resolved** | `7a43595` (drain deleted) |
| **Nm4** | Minor | `finally` preserves `_lastBindError` after clean cancellation | ⚪ **Accepted** | (by design) |
| **Nm5** | Minor | `EventLogBridge` no startup log of armed state | ⚪ **Accepted** | (low value) |
| **Nm6** | Minor | `_admin` lazy resolution returns `null` silently if registration absent | ✅ **Resolved** | (W4-followup) |
| **Nm7** | Minor | `AdminEndpointHost.DisposeAsync` no `_disposed` guard | ✅ **Resolved** | (W4-followup) |
| **Nm8** | Minor | `TearDownBackendAsync` cosmetic log-noise on queued cascades | ⚪ **Accepted** | (cosmetic) |
| **T1** | Test | Reflection coupling on private field names | ⚪ **Accepted** | (commented in code) |
| **T2** | Test | `WatchdogVsResponse_Race` seeded `Random` cross-runtime fragility | ✅ **Resolved** | `7a43595` |
| **T3** | Test | `RemoveInheritedAppsettings` only fires on Build, not Publish | ✅ **Resolved** | (W4-followup) |
| **T4** | Test | Stale `TryAttachOrCreate_*_ReturnsTrue_*` test method names after W3 dropped the bool | ✅ **Resolved** | (W4-followup) |
**Resolved: 13/18. Accepted: 5/18.**
## Third pass — final findings (Wave 5)
A third focused review pass on the post-W4-followup state turned up these additional items:
| ID | Severity | Finding | Status | Commit |
|----|----------|---------|--------|--------|
| **W5/M1** | Major | `AdminEndpointHost` `OnChange` callback can resurrect a Kestrel app after `StopAsync` returned (no `_disposed` check inside the fire-and-forget Task.Run lambda) | ✅ **Resolved** | (W5) |
| **W5/m1** | Minor | `TearDownBackendAsync` gate-not-held path: a concurrent freshly-allocated TxId can collide with one being released by the channel drain → silent request drop. Probability very low (gate timeout AND new accept AND TxId collision in 65,536-slot space). | ⚪ **Accepted** | (W5 — inline doc comment in `PlcMultiplexer.cs`) |
| **W5/m2** | Minor | `inFlightAtCancel` was computed AFTER `base.StopAsync` — narrower window than the field name promises | ✅ **Resolved** | (W5) |
| **W5/m3** | Cosmetic | `CountInFlight` allocates a 35-field `CounterSnapshot` record per supervisor on shutdown | ⚪ **Accepted** (skip) | — |
**W5/M1 fix detail.** Added `if (_disposed) return;` at the top of the `OnChange` lambda AND inside the queued `Task.Run`, plus `try/catch (ObjectDisposedException)` around `_lock.WaitAsync` and `_lock.Release()` so a hot-reload of `AdminPort` during shutdown can no longer resurrect a fresh Kestrel WebApplication on the new port after the host considered admin shut down.
**W5/m2 fix detail.** Moved `int inFlightAtCancel = CountInFlight();` to BEFORE `await base.StopAsync(cancellationToken)`. Now the count actually reflects "in-flight at the moment the host signalled stop" — not "in-flight at the moment we got around to computing it after the cancel propagated."
**W5/m1 acceptance.** Documented inline at `PlcMultiplexer.cs:TearDownBackendAsync` near the `gateHeld` flag declaration. The race requires three coincidences (gate-timeout + new accept landing during cascade + TxId collision); the only consequence is one dropped request that the client retries on its next attempt.
**W5/m3 skip.** Trivial per-PLC allocation (~5 KB on a 54-PLC fleet, called once per shutdown). Optimising it would require exposing a single-field accessor on `ProxyCounters`; not worth the surface change.
---
## Resolved findings — what landed
### NC1 — `ProxyWorker.StopAsync` drain loop is structurally always-zero
**Resolution:** `7a43595` snapshots `inFlightAtCancel = CountInFlight()` BEFORE calling `supervisor.StopAsync(...)`. The post-stop drain loop and `drainCts` are deleted entirely. The supervisor stop IS the drain — there's nothing to wait for that wouldn't be killed by the stop itself. `mbproxy.shutdown.complete` now reports a meaningful "requests dropped by stop" count.
### NM1 — `TearDownBackendAsync._connectGate.WaitAsync()` uncancellable
**Resolution:** `7a43595` bounds the wait with a 2-second teardown CTS. On timeout the body proceeds best-effort without the gate; `gateHeld` flag tracks whether we acquired it so the `finally` only releases when appropriate. `ObjectDisposedException` from a disposed semaphore short-circuits to a clean return.
### NM2 — `ReplaceContext` non-atomic ctx + provider swap
**Resolution:** `7a43595` swapped the order: provider FIRST, then `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new) — never (old-after-disposed).
### NM3 — `_supervisorCts` leaks across `StartAsync` re-entry
**Resolution:** W4-followup. `StartAsync` now does `try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { }` before reassigning, with the catch covering the very-first-Start case where the field-init CTS is still fresh.
### NM4 — W2.15 TCS never re-armed (supervisor single-shot)
**Resolution:** W4-followup. `_firstAttemptCompleted` is now non-readonly and re-created in `StartAsync` after the W2.16 guard. A re-Started supervisor's `WaitForInitialBindAttemptAsync` no longer observes the previous run's signal.
### NM5 — Self-cascade `ObjectDisposedException` after dispose
**Resolution:** `7a43595` gated the writer + reader fault-path `_ = TearDownBackendAsync(...)` calls behind `if (!_disposeCts.IsCancellationRequested)`. DisposeAsync runs an explicit teardown anyway, so the fire-and-forget path was redundant *and* exception-throwing.
### Nm1 — Saturation cleanup uses `await SendResponseAsync`
**Resolution:** `7a43595` replaced the per-party `await SendResponseAsync` with `TrySendResponse` and `IncrementResponseDropForFullUpstream` on drop. Same doctrine as W1.3.
### Nm3 — `gracefulMs` used twice
**Resolution:** Implicit — Wave 4's NC1 fix deleted the drain loop, so the budget is now used exactly once (supervisor stop). Worst-case shutdown is `gracefulMs + 2 s admin`.
### Nm6 — `_admin` lazy resolution silent null
**Resolution:** W4-followup. `ProxyWorker.ExecuteAsync` now logs a Warning when `GetService<AdminEndpointHost>()` returns null, surfacing a botched composition without blocking startup. Previous IHostedService registration would have hard-errored in DI; this preserves the loud-failure intent without forcing callers to register admin in unit-test hosts.
### Nm7 — `AdminEndpointHost.DisposeAsync` no `_disposed` guard
**Resolution:** W4-followup. Added a `volatile bool _disposed` flag and a guard at the top of `DisposeAsync`; symmetry with `PlcMultiplexer`.
### T2 — `WatchdogVsResponse_Race` seeded `Random` fragility
**Resolution:** `7a43595` replaced `new Random(12345) → rng.Next(350, 450)` with a counter-based alternation `(n & 1) == 1 ? 350 : 450`. Guaranteed 15 fast + 15 slow across 30 iterations regardless of runtime.
### T3 — `RemoveInheritedAppsettings` only on Build
**Resolution:** W4-followup. `AfterTargets="Build;Publish"` and a second `Delete` against `$(PublishDir)appsettings.json` (guarded by `Condition="'$(PublishDir)' != ''"` so a plain Build doesn't trip it).
### T4 — Stale `TryAttachOrCreate_*` test names
**Resolution:** W4-followup. Renamed three test methods to drop `Try` and `_ReturnsTrue_` to match the W3 `AttachOrCreate` (void-returning) signature.
---
## Accepted findings — what wasn't fixed and why
### Nm2 — `CoalescedHit` increments for late attachers that get exception 04
**Why accepted:** The counter is correct per the design contract (`coalescedHit + coalescedMiss = total FC03/FC04 requests`). A late attacher *did* coalesce onto a stub; the fact that the stub later delivered an exception instead of a real response is orthogonal to the coalesce-counting semantics. Decrementing on saturation cleanup would break the parity invariant that operator dashboards rely on. Documented inline at `PlcMultiplexer.cs:809-812` (W2.4 doc clarification covers this).
### Nm4 — `_lastBindError` preserved after clean cancellation
**Why accepted:** Intentional. The field's contract is "last fault while running" — useful for operators triaging *why* a supervisor exited (was it a bind error vs a clean stop?). Clearing it on clean exit would lose forensic information. The `finally` block at `PlcListenerSupervisor.cs:435-436` already documents this with a comment.
### Nm5 — `EventLogBridge` no startup log of armed state
**Why accepted:** The bridge is a Serilog sink and doesn't have an injectable logger (would require restructuring to use `Log.ForContext` or similar). The class doc at `EventLogBridge.cs:18-20` already tells operators "must be created by `install.ps1` before the service starts". The W2.23 caching means a missing source is silent, but the architectural alternative (log from the bridge constructor using `Log.Logger`) creates a startup-ordering dependency on Serilog being fully wired before any sink construction. Low value vs the surface change.
### Nm8 — `TearDownBackendAsync` log-noise on queued cascades
**Why accepted:** Cosmetic. Each cascade logs its own `BackendDisconnected` event with its own counts; queued cascades on the gate would fire mostly-zero events. Log filtering is the operator's tool here.
### T1 — Reflection on private field names in race tests
**Why accepted:** The two reflection sites (`DrainAllocator` in `PlcMultiplexerTests`, the runtime-fault test in `SupervisorTests`) are the only externally-impossible-to-hit primitives that prove the saturation/runtime-fault contracts. The alternatives are (a) introducing test-only `internal` accessors that pollute production code, (b) exposing fields via `[InternalsVisibleTo]` (which Mbproxy.csproj already does for the test project — but the *fields* themselves are private, not internal), or (c) skipping the tests. The reflection coupling is documented as a maintenance trap in the test code. A future rename refactor breaks at run-time, not compile-time — but the tests' xmldoc explicitly warns about this.
---
## Verified clean (sampled, not exhaustive)
The original re-review listed the following as verified clean by inspection during the post-Wave-3 review pass; nothing in Wave 4 / W4-followup invalidated these:
- **W2.3 ConcurrentDictionary migration** on `_supervisors` — all mutations atomic; status-page enumeration lock-free; Restart's "remove + add" two-step is per-key (parallel keys disjoint by name).
- **W2.1 coalescingAccessor propagation** — both Add and Restart paths receive it; Reseat correctly does not (same supervisor, same multiplexer, same accessor).
- **W2.13 OOR check** — multiplication is bounded by the guard; even worst-case `9999 * 10_000 + 9999 = 99_989_999` fits in int32 without overflow.
- **W2.14 byteCount validation** — strict `<` check passes a perfectly-sized PDU; trailing-byte case correct.
- **W2.10 resolved-TTL re-check** — `BcdTagMapBuilder.Build` is called exactly once per PLC at validation; no duplicate work.
- **W2.18 ConnectionOptions validation** — both `MbproxyOptionsValidator` and `ReloadValidator` reject `<= 0`; no bypass path.
- **W3 `HasBadNibble` dedupe** — clean; the codec's internal helper is the single source of truth.
- **W2.15 TCS signalled in every exit path** of `RunSupervisorAsync` — no hang on `WaitForInitialBindAttemptAsync` for the first run. (W4 / NM4 added the re-arm for subsequent Starts.)
- **W2.17 `TransitionTo` lock contract** — both writers use it; `Snapshot` reads under the same lock; no torn triples.
- **`TxIdAllocator.Release` double-call is benign** (`TxIdAllocator.cs:121-129` checks `if (_inUse[id])`); the W1.4 channel drain releasing a TxId already released by the correlation drain is safe.
- **W1.1 in-PDU snapshot consistency** — `OnUpstreamFrameAsync` reads `_ctx.Cache` and `_ctx.TagMap.ResolveCacheTtlMs` non-atomically; the only mid-PDU swap visible would change cache eligibility, not produce corrupted output. Downstream `WithCurrentRequest` snapshots TagMap+Cache for the rewriter, so the rewrite itself is consistent.
- **W2.7 cache-FC byte sourced from post-rewriter buffer** — correct; the rewriter never touches the FC byte but the source must remain `frame[…]` to capture the exception bit.
## Closed
The 2026-05-14 review series — original review → 4 remediation waves → first re-review → wave 4 + followup → second re-review → wave 5 — is now closed. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green. Every actionable finding resolved or explicitly accepted with rationale.
+3 -2
View File
@@ -4,7 +4,7 @@ The proxy holds one persistent backend TCP socket per PLC and multiplexes many u
## Why One Backend Connection Per PLC ## Why One Backend Connection Per PLC
An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../../DL260/dl205.md`](../../DL260/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop. An earlier design opened a fresh backend socket for each accepted upstream client (1:1 pairs). That model collapsed against the **AutomationDirect H2-ECOM100**, which caps simultaneous TCP clients at **4 per PLC** (see [`../Reference/dl205.md`](../Reference/dl205.md) under "Behavioural Oddities"). The fifth upstream client to attach to a busy PLC was refused at connect, with no recourse other than waiting for an existing pair to drop.
Multiplexing replaces 1:N upstream-to-backend with N:1 upstream-to-multiplexer-to-backend: Multiplexing replaces 1:N upstream-to-backend with N:1 upstream-to-multiplexer-to-backend:
@@ -240,8 +240,9 @@ The per-request timeout watchdog described above is the production defence again
- [`./Overview.md`](./Overview.md) — proxy architecture entry point - [`./Overview.md`](./Overview.md) — proxy architecture entry point
- [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties` - [`./ReadCoalescing.md`](./ReadCoalescing.md) — FC03/FC04 fan-out built on `InterestedParties`
- [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer - [`./ResponseCache.md`](./ResponseCache.md) — per-PLC FC03/FC04 cache layered in front of this multiplexer
- [`./Keepalive.md`](./Keepalive.md) — TCP keepalive and the backend heartbeat that keeps this socket warm
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning - [`../Operations/Configuration.md`](../Operations/Configuration.md) — `Connection.BackendConnectTimeoutMs`, `Connection.BackendRequestTimeoutMs`, retry tuning
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters - [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — `inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades` counters
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.multiplex.*` structured log events
- [`../Testing/Simulator.md`](../Testing/Simulator.md) — pymodbus 3.13.0 deferred-handler quirk in detail - [`../Testing/Simulator.md`](../Testing/Simulator.md) — pymodbus 3.13.0 deferred-handler quirk in detail
- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap - [`../Reference/dl205.md`](../Reference/dl205.md) — DL205/DL260 quirks including the 4-client ECOM cap
+76
View File
@@ -0,0 +1,76 @@
# Keepalive & Connection Monitoring
The DL205/DL260 ECOM does not emit TCP keepalives (see [`../Reference/dl205.md`](../Reference/dl205.md) → "Behavioural Oddities"). An idle socket is silently dropped by middleboxes — switches, firewalls, NAT — typically after 25 minutes. The proxy holds one **persistent backend socket per PLC** ([`./ConnectionModel.md`](./ConnectionModel.md)) plus many accepted upstream client sockets, so it needs its own keepalive on both sides.
Keepalive is **enabled by default** and is governed by the `Connection.Keepalive` option block (see [`../Operations/Configuration.md`](../Operations/Configuration.md)). Set `Connection.Keepalive.Enabled = false` to restore pre-keepalive behaviour exactly.
## Two mechanisms
| Mechanism | Scope | Detects |
|-----------|-------|---------|
| OS TCP keepalive (`SO_KEEPALIVE`) | Backend socket **and** accepted upstream sockets | A peer whose TCP stack is gone (host down, cable pulled, half-open socket). |
| Application heartbeat (FC03 probe) | Backend socket only | The above **plus** a middlebox idle-drop and an ECOM that is connected-but-not-answering Modbus. |
The application heartbeat is the load-bearing mechanism; OS keepalive is a cheap belt-and-suspenders that also covers the window between heartbeat ticks.
## Backend: OS TCP keepalive
`SocketKeepalive.Apply` sets `SO_KEEPALIVE` plus the idle-time / probe-interval / probe-count tunables on the backend `Socket` right after it is created in `PlcMultiplexer.EnsureBackendConnectedAsync`. The tunables come from `Connection.Keepalive.Tcp*`. Socket options are applied **at connect time** — a hot-reload of the `Tcp*` values only affects backend sockets opened *after* the change.
## Backend: application heartbeat
A per-`PlcMultiplexer` background loop (`RunBackendHeartbeatAsync`) is started alongside the backend writer and reader on each successful connect, under the same `_backendCts`, and dies with them on teardown.
- The multiplexer tracks `_lastBackendActivityUtc`, updated by **both** the writer (on every send) and the reader (on every received frame). Real traffic in either direction therefore suppresses the heartbeat.
- Each tick (a quarter of `BackendHeartbeatIdleMs`, floored at 500 ms), if the socket has been idle longer than `BackendHeartbeatIdleMs`, the loop issues a **synthetic FC03 qty=1 read** at `BackendHeartbeatProbeAddress` (default 0 = `V0`, valid on DL205/DL260). FC08 (Diagnostics) is **not** supported by the DL260 ECOM, so the probe must be a real register read.
- The probe targets the unit ID of the most recent upstream request, so it reaches the same Modbus unit the real clients successfully use.
- The probe takes a real proxy TxId and a `CorrelationMap` entry flagged `InFlightRequest.IsHeartbeat`. It is enqueued straight onto the backend outbound channel, **bypassing** the read-coalescing and response-cache paths.
### Heartbeat response
The backend reader recognises an `IsHeartbeat` correlation entry, refreshes the idle timer (already done on frame receipt), frees the TxId, and **drops the payload** — no rewriter, no cache write-through, no fan-out, and no round-trip-EWMA sample (the synthetic probe never pollutes the client-facing RTT metric).
### Heartbeat timeout
If a probe is not answered within `BackendRequestTimeoutMs`, the per-request timeout watchdog ([`./ConnectionModel.md`](./ConnectionModel.md) → "Per-Request Timeout Watchdog") finds the stale `IsHeartbeat` entry and — instead of dispatching a 0x0B exception to a (non-existent) upstream party — calls `TearDownBackendAsync`, cascading every attached upstream pipe.
This is a **proactive** version of the existing backend-disconnect cascade: the dead path is found during idle instead of corrupting the next real client request. Reconnect stays lazy — the heartbeat keeps an *existing* backend warm, it never resurrects a dead one and adds no eager-reconnect spinner. Clients reconnect on their next request, exactly as for an organic cascade.
`BackendHeartbeatIdleMs` must be greater than `BackendRequestTimeoutMs` (enforced by the reload validator) — a heartbeat interval at or below the request timeout would fire continuously.
## Upstream: OS TCP keepalive
`SocketKeepalive.Apply` is also called on each accepted client `Socket` in the `UpstreamPipe` constructor. This is the **only** standard keepalive available on the upstream side: Modbus TCP is strictly client-initiated, so the proxy — a server to its clients — cannot send an unsolicited application heartbeat to a client. OS keepalive lets the proxy's TCP stack probe each client; a dead or half-open client then faults the pipe's read loop, the pipe is disposed, and its correlation / coalescing slots are freed instead of leaking until the proxy next tries to write.
## Counters
Per-PLC, exposed on the status page (see [`../Operations/StatusPage.md`](../Operations/StatusPage.md)):
| Counter | Meaning |
|---------|---------|
| `backendHeartbeatsSent` | Heartbeat probes issued on idle backend sockets. |
| `backendHeartbeatsFailed` | Probes not answered within `BackendRequestTimeoutMs`. |
| `backendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat (event count — distinct from `disconnectCascades`, which counts cascaded pipes). |
## Log events
`mbproxy.keepalive.*` — see [`../Reference/LogEvents.md`](../Reference/LogEvents.md):
- `mbproxy.keepalive.heartbeat.sent` (Debug)
- `mbproxy.keepalive.heartbeat.timeout` (Warning)
- `mbproxy.keepalive.backend.idle_disconnect` (Information)
## Hot reload
`Connection.Keepalive` is read through a live accessor (`Func<KeepaliveOptions>`), so a reload of `appsettings.json` propagates without a listener restart:
- The **heartbeat** interval and probe address are re-read on every loop tick.
- The **TCP socket options** are applied at connect/accept time, so a reload affects only sockets opened after the change.
## Related documentation
- [`./ConnectionModel.md`](./ConnectionModel.md) — backend socket lifecycle, the timeout watchdog, and the disconnect cascade this feature hooks into
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — the `Connection.Keepalive` option block
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — keepalive counters
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.keepalive.*` events
- [`../Reference/dl205.md`](../Reference/dl205.md) — the device "no keepalive" oddity and FC03/FC08 support
+1 -3
View File
@@ -6,7 +6,7 @@ This document is the entry point for readers new to the codebase. It sketches th
## Runtime Shape ## Runtime Shape
The process is a single .NET 10 Generic Host worker. `Microsoft.Extensions.Hosting.WindowsServices` registers the host as a Windows Service so the same binary runs interactively (for development) or under the SCM (in production). All configuration binds from `appsettings.json` through `IOptionsMonitor<MbproxyOptions>`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page. The process is a single .NET 10 Generic Host worker. It registers both `Microsoft.Extensions.Hosting.WindowsServices` and `Microsoft.Extensions.Hosting.Systemd` — each a no-op off its own init system — so the same binary runs interactively (for development), as a Windows Service under the SCM, or as a Linux systemd unit. All configuration binds from `appsettings.json` through `IOptionsMonitor<MbproxyOptions>`, which makes the tag list and PLC roster hot-reloadable without process restart. `ProxyWorker` is the long-lived `BackgroundService` that owns startup, shutdown, and the listener supervisors for every PLC. A small Kestrel admin endpoint runs in the same process to serve the read-only status page.
There is no in-process database, no message broker, and no persistent cache file: state is per-PLC, in-memory, and ephemeral. Restarting the service drops every in-flight request and every cached response. Upstream clients are expected to reconnect and reissue; the proxy never replays a request on their behalf. There is no in-process database, no message broker, and no persistent cache file: state is per-PLC, in-memory, and ephemeral. Restarting the service drops every in-flight request and every cached response. Upstream clients are expected to reconnect and reissue; the proxy never replays a request on their behalf.
@@ -145,6 +145,4 @@ The simulator used by the end-to-end test suite — a `pymodbus`-based stand-in
- [`../Operations/Configuration.md`](../Operations/Configuration.md) — `appsettings.json` schema and tag list shape. - [`../Operations/Configuration.md`](../Operations/Configuration.md) — `appsettings.json` schema and tag list shape.
- [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — the Kestrel admin endpoint and counter catalog. - [`../Operations/StatusPage.md`](../Operations/StatusPage.md) — the Kestrel admin endpoint and counter catalog.
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — stable structured log event names. - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — stable structured log event names.
- [`../design.md`](../design.md) — canonical design decisions and rationale.
- [`../Testing/Simulator.md`](../Testing/Simulator.md) — `pymodbus` DL205 simulator used by the end-to-end suite. - [`../Testing/Simulator.md`](../Testing/Simulator.md) — `pymodbus` DL205 simulator used by the end-to-end suite.
- [`../plan/README.md`](../plan/README.md) — phase plan with per-phase test inventory.
+11 -2
View File
@@ -303,6 +303,17 @@ cached read is still consistent with the device's actual state. Skipping
the invalidation matches reality — the write did not take effect, so the the invalidation matches reality — the write did not take effect, so the
read is not stale. read is not stale.
The skip is **structural**, not conditional. Cache invalidation only
fires inside the per-PLC backend reader task, after a non-exception
FC06/FC16 response arrives from the PLC. A `recovering` supervisor has
torn down its multiplexer and there is no backend reader, so no response
can land and the invalidation path is never entered. This is the
reasoning the code at `Proxy/Multiplexing/PlcMultiplexer.cs` documents
inline (W2.9). If a future change ever produced a write response off the
live backend (e.g. a mocked-response path), an explicit `Recovering`
check would need to be added at the invalidator call site to keep the
skip semantics correct.
## No Persistence ## No Persistence
The cache is purely in-memory. Process restart wipes every entry. There The cache is purely in-memory. Process restart wipes every entry. There
@@ -394,5 +405,3 @@ configuration described above.
`mbproxy.cache.*` event catalogue with event IDs. `mbproxy.cache.*` event catalogue with event IDs.
- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the - [`../Testing/Simulator.md`](../Testing/Simulator.md) — the
`pymodbus` DL205 stand-in used by the end-to-end cache tests. `pymodbus` DL205 stand-in used by the end-to-end cache tests.
- [`../design.md`](../design.md) — canonical design decisions and
rationale.
+3 -3
View File
@@ -4,7 +4,7 @@ The BCD rewriter is the inline codec that translates DirectLOGIC's native Binary
## Why BCD Rewriting Exists ## Why BCD Rewriting Exists
The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../../DL260/dl205.md`](../../DL260/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules. The DL205 / DL260 family stores numeric V-memory register values in native BCD, not binary. The decimal integer `1234` in `V2000` lands on the Modbus wire as `0x1234` (nibbles `1`, `2`, `3`, `4`) — not as the binary `0x04D2`. See [`../Reference/dl205.md`](../Reference/dl205.md) for the device-side rationale and the V-memory ↔ Modbus translation rules.
Upstream consumers (Wonderware, Historian, OPC UA gateways, generic Modbus clients written against the standard) expect plain binary integers. Asking every consumer to BCD-decode the wire is brittle: each consumer would carry the same tag list, the same word-order quirks, and the same risk of drift. The rewriter centralises that translation so the rest of the world sees plain `Int16` / `Int32` and the proxy is the single source of truth for "which addresses are BCD." Upstream consumers (Wonderware, Historian, OPC UA gateways, generic Modbus clients written against the standard) expect plain binary integers. Asking every consumer to BCD-decode the wire is brittle: each consumer would carry the same tag list, the same word-order quirks, and the same risk of drift. The rewriter centralises that translation so the rest of the world sees plain `Int16` / `Int32` and the proxy is the single source of truth for "which addresses are BCD."
@@ -18,7 +18,7 @@ A 32-bit BCD value spans a register pair at `Address` and `Address+1` in CDAB (l
- The register at `Address+1` holds the **high 4 BCD digits**. - The register at `Address+1` holds the **high 4 BCD digits**.
- Decoded decimal = `Decode16(high) * 10_000 + Decode16(low)`. - Decoded decimal = `Decode16(high) * 10_000 + Decode16(low)`.
This follows directly from DirectLOGIC's CDAB word convention (see [`../../DL260/dl205.md`](../../DL260/dl205.md) → Word Order). This follows directly from DirectLOGIC's CDAB word convention (see [`../Reference/dl205.md`](../Reference/dl205.md) → Word Order).
Worked example — the register pair `[0x1234][0x5678]` reads on the wire as the low word `0x1234` first and the high word `0x5678` second: Worked example — the register pair `[0x1234][0x5678]` reads on the wire as the low word `0x1234` first and the high word `0x5678` second:
@@ -249,4 +249,4 @@ A few invariants the rewriter relies on and the test suite enforces:
- [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) — diagnosing partial-overlap warnings - [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md) — diagnosing partial-overlap warnings
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.rewrite.*` event catalogue - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — `mbproxy.rewrite.*` event catalogue
- [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `dl205.json` simulator profile that encodes BCD test fixtures - [`../Testing/Simulator.md`](../Testing/Simulator.md) — the `dl205.json` simulator profile that encodes BCD test fixtures
- [`../../DL260/dl205.md`](../../DL260/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation - [`../Reference/dl205.md`](../Reference/dl205.md) — DL205 / DL260 BCD encoding, CDAB word order, and V-memory ↔ Modbus translation
+1 -1
View File
@@ -6,7 +6,7 @@ A save to `appsettings.json` propagates to a running `mbproxy` without restartin
`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`. Every consumer reads its options through `IOptionsMonitor<MbproxyOptions>` instead of capturing a one-shot `IOptions<T>` snapshot at construction. When the framework's `FileSystemWatcher` sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through `IOptionsMonitor.OnChange`. `Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`. Every consumer reads its options through `IOptionsMonitor<MbproxyOptions>` instead of capturing a one-shot `IOptions<T>` snapshot at construction. When the framework's `FileSystemWatcher` sees the file change, it re-parses the JSON, re-binds the option tree, and notifies subscribers through `IOptionsMonitor.OnChange`.
The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively or as a Windows Service under the SCM. The chosen mechanism is deliberate. There is no custom file watcher, no IPC channel, no admin-port mutation endpoint, and no SIGHUP-style trigger. An operator edits the file in place (or a deployment tool atomically rewrites it) and the running service catches up. The reload contract is identical whether the service is running interactively, as a Windows Service under the SCM, or as a Linux systemd unit.
The `OnChange` callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three `FileSystemWatcher` events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply. The `OnChange` callback can fire multiple times for a single logical save because text editors on Windows commonly use a rename-and-replace pattern that produces two or three `FileSystemWatcher` events. The reconciler debounces these inside its own background loop with a 250 ms quiescent window so a single save produces a single apply.
+33 -4
View File
@@ -7,8 +7,11 @@
The configuration loader resolves `appsettings.json` relative to the executable. The configuration loader resolves `appsettings.json` relative to the executable.
- **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output. - **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output.
- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder. - **Single-file publish** (`dotnet publish -c Release -r <rid>`): `appsettings.json` next to the published binary. A `win-x64` publish ships `install/mbproxy.config.template.json`; a `linux-x64` publish ships `install/mbproxy.linux.config.template.json` (same keys, Unix log path) — each linked into the bundle as `appsettings.json`.
- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls. - **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`, seeded by `install.ps1` from `mbproxy.config.template.json`.
- **Installed as a systemd unit**: `/etc/mbproxy/appsettings.json` (the unit's `WorkingDirectory`), seeded by `install.sh` from the Linux template.
In both installed cases the install script copies the template only when no config already exists — an existing file is preserved across reinstalls.
The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics. The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics.
@@ -51,11 +54,19 @@ Every supported key under `Mbproxy:*`, populated to a representative default:
// Read-only HTTP status page. Set to 0 to disable. // Read-only HTTP status page. Set to 0 to disable.
"AdminPort": 8080, "AdminPort": 8080,
// Backend connection / request / shutdown timeouts. // Backend connection / request / shutdown timeouts and keepalive.
"Connection": { "Connection": {
"BackendConnectTimeoutMs": 3000, "BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000, "BackendRequestTimeoutMs": 3000,
"GracefulShutdownTimeoutMs": 10000 "GracefulShutdownTimeoutMs": 10000,
"Keepalive": {
"Enabled": true,
"TcpIdleTimeMs": 30000,
"TcpProbeIntervalMs": 5000,
"TcpProbeCount": 4,
"BackendHeartbeatIdleMs": 30000,
"BackendHeartbeatProbeAddress": 0
}
}, },
// Polly resilience policies. // Polly resilience policies.
@@ -86,6 +97,8 @@ Every supported key under `Mbproxy:*`, populated to a representative default:
`Serilog` configuration is documented in [`./Troubleshooting.md`](./Troubleshooting.md) and lives outside the `Mbproxy` section. `Serilog` configuration is documented in [`./Troubleshooting.md`](./Troubleshooting.md) and lives outside the `Mbproxy` section.
> The Windows Event Log sink is **not** the standard `Serilog.Sinks.EventLog` package. It is a custom `EventLogBridge` (`src/Mbproxy/Diagnostics/EventLogBridge.cs`) that writes Error+ events to the `mbproxy` source under `Application` only when the service runs under the SCM. Event Log source registration is intentionally NOT attempted at runtime (the service account may not be admin); `install.ps1` registers the source at install time. Don't add `Serilog.Sinks.EventLog` — the bridge would duplicate every event. The bridge caches the source-exists check at construction (Phase 12 / W2.23), so a missing source produces no per-event registry traffic.
## `Mbproxy.AdminPort` ## `Mbproxy.AdminPort`
Port for the read-only HTTP status server. Binds to all interfaces on startup. Port for the read-only HTTP status server. Binds to all interfaces on startup.
@@ -167,6 +180,21 @@ Operational sizing notes:
- A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client. - A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client.
- `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep. - `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep.
## `Mbproxy.Connection.Keepalive`
TCP keepalive and backend heartbeat settings. Source: `KeepaliveOptions.cs`. Enabled by default — the DL205/DL260 ECOM never emits TCP keepalives, so an idle socket is otherwise dropped by middleboxes after 25 minutes. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md) for the full design.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `Enabled` | bool | `true` | Master switch. When `false`, neither `SO_KEEPALIVE` nor the backend heartbeat is applied and the proxy behaves exactly as a pre-keepalive build. |
| `TcpIdleTimeMs` | int | `30000` | `SO_KEEPALIVE` idle time before the OS sends its first probe. Applied to the backend socket and accepted upstream sockets. |
| `TcpProbeIntervalMs` | int | `5000` | `SO_KEEPALIVE` interval between probes once idle. |
| `TcpProbeCount` | int | `4` | `SO_KEEPALIVE` unanswered probes before the OS declares the socket dead. |
| `BackendHeartbeatIdleMs` | int | `30000` | After this much backend idle, the proxy issues a synthetic FC03 qty=1 read to keep the path warm and prove the ECOM still answers Modbus. Must be greater than `BackendRequestTimeoutMs`. |
| `BackendHeartbeatProbeAddress` | int | `0` | Modbus PDU address the heartbeat FC03 probe reads. Address `0` (`V0`) is valid on DL205/DL260 in factory absolute mode. Range `[0, 65535]`. |
On hot reload, the heartbeat interval and probe address are re-read on every loop tick. The `Tcp*` socket options are applied at connect/accept time, so a reload affects only sockets opened after the change. A reload where `BackendHeartbeatIdleMs <= BackendRequestTimeoutMs` is rejected — a heartbeat interval at or below the request timeout would fire continuously.
## `Mbproxy.Resilience` ## `Mbproxy.Resilience`
Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`. Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`.
@@ -389,6 +417,7 @@ A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restri
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. | | `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. |
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. | | `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
| `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. | | `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. |
| `Connection.Keepalive` heartbeat fields | Re-read on every heartbeat loop tick. `Tcp*` socket options apply to backend/upstream sockets opened after the change. |
| `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. | | `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. |
| `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. | | `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. |
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. | | `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. |
+22 -7
View File
@@ -135,6 +135,16 @@ These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. | | `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. | | `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
### Keepalive counters
These fields describe the backend keepalive heartbeat. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.backendHeartbeatsSent` | `long` | `CounterSnapshot.BackendHeartbeatsSent` | Synthetic FC03 heartbeat probes issued on this PLC's idle backend socket. |
| `backend.backendHeartbeatsFailed` | `long` | `CounterSnapshot.BackendHeartbeatsFailed` | Heartbeat probes not answered within `BackendRequestTimeoutMs`. Each failure tears the backend down. |
| `backend.backendIdleDisconnects` | `long` | `CounterSnapshot.BackendIdleDisconnects` | Backend teardowns triggered by a failed heartbeat — an event count, distinct from `disconnectCascades` (which counts cascaded pipes). Sustained growth means a PLC is repeatedly going dark while idle. |
### Bytes ### Bytes
| JSON path | Type | Source | Meaning | | JSON path | Type | Source | Meaning |
@@ -224,7 +234,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 88691, "cacheMissCount": 88691,
"cacheInvalidations": 6203, "cacheInvalidations": 6203,
"cacheEntryCount": 47, "cacheEntryCount": 47,
"cacheBytes": 18512 "cacheBytes": 18512,
"backendHeartbeatsSent": 412,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
}, },
"bytes": { "bytes": {
"upstreamIn": 4108290, "upstreamIn": 4108290,
@@ -267,7 +280,10 @@ A representative two-PLC deployment, ~2 hours into a run:
"cacheMissCount": 0, "cacheMissCount": 0,
"cacheInvalidations": 0, "cacheInvalidations": 0,
"cacheEntryCount": 0, "cacheEntryCount": 0,
"cacheBytes": 0 "cacheBytes": 0,
"backendHeartbeatsSent": 0,
"backendHeartbeatsFailed": 0,
"backendIdleDisconnects": 0
}, },
"bytes": { "upstreamIn": 0, "upstreamOut": 0 } "bytes": { "upstreamIn": 0, "upstreamOut": 0 }
} }
@@ -282,10 +298,10 @@ The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy
Structure: Structure:
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable. 1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell. 2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell, keepalive cell.
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span. 3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet). The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. The keepalive cell shows the heartbeat-sent count, with `(fail N, idle-disc N)` appended only when either is non-zero. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view. The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
@@ -317,9 +333,9 @@ curl -s http://mbproxy-host:8080/status.json |
Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format. Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.
## Where the KPIs Live ## Scope of This Document
This document covers the **endpoint surface**: what is on the wire and how each field is computed. The **dashboard composition** — which counters roll up into which Grafana panels, alerting thresholds, fleet-aggregate definitions — lives in [`../kpi.md`](../kpi.md). Keep the two documents disjoint: when a new counter is added, list it here; when a new panel or rate calculation is added, add it to `kpi.md`. This document covers the **endpoint surface**: what is on the wire and how each field is computed. When a new counter is added, list it here.
## Related Documentation ## Related Documentation
@@ -331,4 +347,3 @@ This document covers the **endpoint surface**: what is on the wire and how each
- [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys. - [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys.
- [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes. - [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes.
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`. - [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`.
- [`../kpi.md`](../kpi.md) — dashboard catalog that consumes these counters.
+26 -3
View File
@@ -2,7 +2,9 @@
Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps. Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`. The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log` on Windows, or `/var/log/mbproxy/mbproxy-<date>.log` on Linux. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the **Windows Application Event Log** (Windows Service) or the **local syslog / journal** (systemd) under source `mbproxy` — view the latter with `journalctl -t mbproxy` or `journalctl -u mbproxy`.
Paths and service commands below are written for Windows (`%ProgramData%`, `sc.exe`); the systemd equivalents are `/etc/mbproxy` + `/var/log/mbproxy` and `systemctl start|stop|status mbproxy`.
## Service Startup Failures ## Service Startup Failures
@@ -101,7 +103,7 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l
Test-NetConnection -ComputerName <plc-ip> -Port 502 Test-NetConnection -ComputerName <plc-ip> -Port 502
``` ```
2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `DL260/mbtcp_settings.JPG` for the as-deployed values). 2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `docs/Reference/mbtcp_settings.JPG` for the as-deployed values).
3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause. 3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause.
4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request). 4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request).
@@ -124,7 +126,28 @@ The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The l
1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds. 1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair. 2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing. 3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause. Keepalive is enabled by default and should already be preventing this — confirm `Connection.Keepalive.Enabled` is `true` and that `BackendHeartbeatIdleMs` is comfortably below the middlebox idle timeout. See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md).
### Backend keepalive heartbeat failing
**Symptom.** A PLC's backend connection is torn down while idle — no client was actively talking to it. `plcs[].backend.backendIdleDisconnects` increments and the upstream clients (if any were attached) are cascaded.
**Where to look.**
- Log events: `mbproxy.keepalive.heartbeat.timeout` (Warning) followed by `mbproxy.keepalive.backend.idle_disconnect` (Information).
- Status fields: `plcs[].backend.backendHeartbeatsSent`, `backendHeartbeatsFailed`, `backendIdleDisconnects`.
**Root causes.**
- The ECOM is reachable at the IP layer but no longer answering Modbus (firmware hang, ECOM reset mid-session).
- The path died between heartbeats and the heartbeat was the first request to discover it — this is the feature working as intended (the failure is found during idle, not on a client request).
- `BackendHeartbeatProbeAddress` points at an address the PLC rejects. The default (0 = `V0`) is safe on DL205/DL260; only an operator override could break this.
**Remediation.**
1. A single idle-disconnect that recovers on the next client request needs no action — the proxy reconnected the path proactively.
2. Repeated idle-disconnects on one PLC mean it keeps going dark while idle. Investigate the device and the network path; the proxy has no state to repair.
3. If `backendHeartbeatsFailed` climbs but the PLC answers real client requests fine, check that `BackendHeartbeatProbeAddress` is a register the device actually serves.
### Request timeout watchdog firing ### Request timeout watchdog firing
+49 -3
View File
@@ -6,9 +6,9 @@ The stable catalog of every `mbproxy.*` event name the service emits, with its l
The service uses [Serilog](https://serilog.net/) wired through the `Microsoft.Extensions.Logging` bridge. Three sinks are configured (see `src/Mbproxy/HostingExtensions.cs`): The service uses [Serilog](https://serilog.net/) wired through the `Microsoft.Extensions.Logging` bridge. Three sinks are configured (see `src/Mbproxy/HostingExtensions.cs`):
- **Console**written to stdout for interactive `--console` runs and for the SCM stdout capture. - **Console**stdout; captured by the Windows SCM or by systemd-journald.
- **Rolling file** under `%ProgramData%\mbproxy\logs\` (`mbproxy-<date>.log`). - **Rolling file**`%ProgramData%\mbproxy\logs\` on Windows, `/var/log/mbproxy/` on Linux (`mbproxy-<date>.log`).
- **Windows Event Log** — only when running as a Windows Service, and only for events at `Error` and above (see `src/Mbproxy/Diagnostics/EventLogBridge.cs`). - **Platform diagnostic sink**`Error`+ events only. `DiagnosticSinkSelector` picks it once at the composition root: the **Windows Application Event Log** under the SCM (`EventLogBridge`), **local syslog** under systemd (`SyslogBridge`), or none for interactive/dev runs.
Every event uses source-generated `[LoggerMessage]` definitions, so the property names below match the message template token-for-token. The default minimum level is `Information`; lower the floor for `Mbproxy.*` categories via the standard `Logging:LogLevel` configuration to surface `Debug` events such as the coalesce and cache traces. Every event uses source-generated `[LoggerMessage]` definitions, so the property names below match the message template token-for-token. The default minimum level is `Information`; lower the floor for `Mbproxy.*` categories via the standard `Logging:LogLevel` configuration to surface `Debug` events such as the coalesce and cache traces.
@@ -385,6 +385,51 @@ Fires whenever the entire per-PLC cache is wiped at once — primarily after a b
**Operator action:** none unless flushes happen on a tight loop, which would indicate the backend connection itself is unstable. **Operator action:** none unless flushes happen on a tight loop, which would indicate the backend connection itself is unstable.
## Keepalive
See [`../Architecture/Keepalive.md`](../Architecture/Keepalive.md) for the backend heartbeat design.
### mbproxy.keepalive.heartbeat.sent
**Level:** Debug &middot; **EventId:** 150 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ProxyTxId` | `ushort` | Proxy-allocated TxId carrying the synthetic FC03 probe. |
| `Address` | `ushort` | Modbus address the probe reads (`BackendHeartbeatProbeAddress`). |
Fires each time the heartbeat loop issues a probe on an idle backend socket — at most one per `BackendHeartbeatIdleMs` per idle PLC.
**Operator action:** none. Debug-level; useful only when confirming the heartbeat is alive.
### mbproxy.keepalive.heartbeat.timeout
**Level:** Warning &middot; **EventId:** 151 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ProxyTxId` | `ushort` | Proxy TxId of the unanswered probe. |
| `ElapsedMs` | `long` | Milliseconds from probe send to timeout. |
Fires when a heartbeat probe is not answered within `BackendRequestTimeoutMs` — the backend is connected but no longer answering Modbus.
**Operator action:** check the PLC and the network path. Paired with `mbproxy.keepalive.backend.idle_disconnect` for the same PLC.
### mbproxy.keepalive.backend.idle_disconnect
**Level:** Information &middot; **EventId:** 152 &middot; **Source:** `src/Mbproxy/Proxy/Multiplexing/KeepaliveLogEvents.cs`
| Property | Type | Meaning |
|----------|------|---------|
| `Plc` | `string` | Configured PLC name. |
| `ElapsedMs` | `long` | Milliseconds the failed heartbeat waited before the teardown. |
Fires when a failed heartbeat triggers a proactive backend teardown. Every attached upstream pipe is cascaded; clients reconnect on their next request. This is the keepalive feature doing its job — finding a dead path during idle instead of on the next real request.
**Operator action:** none if isolated. Repeated idle-disconnects on one PLC indicate it keeps going dark while idle — investigate the device or the network path.
## BCD Rewriter ## BCD Rewriter
### mbproxy.rewrite.partial_bcd ### mbproxy.rewrite.partial_bcd
@@ -495,5 +540,6 @@ Lifecycle events (`startup.*`, `listener.*`, `admin.*`, `shutdown.*`, `config.re
- [Response Cache](../Architecture/ResponseCache.md) — context for the `mbproxy.cache.*` events. - [Response Cache](../Architecture/ResponseCache.md) — context for the `mbproxy.cache.*` events.
- [Status Page](../Operations/StatusPage.md) — counter equivalents for the high-volume Debug-level events. - [Status Page](../Operations/StatusPage.md) — counter equivalents for the high-volume Debug-level events.
- [Read Coalescing](../Architecture/ReadCoalescing.md) — context for the `mbproxy.coalesce.*` events. - [Read Coalescing](../Architecture/ReadCoalescing.md) — context for the `mbproxy.coalesce.*` events.
- [Keepalive](../Architecture/Keepalive.md) — context for the `mbproxy.keepalive.*` events.
- [BCD Rewriting](../Features/BcdRewriting.md) — context for the `mbproxy.rewrite.*` and `mbproxy.exception.passthrough` events. - [BCD Rewriting](../Features/BcdRewriting.md) — context for the `mbproxy.rewrite.*` and `mbproxy.exception.passthrough` events.
- [Hot Reload](../Features/HotReload.md) — context for the `mbproxy.config.reload.*` events. - [Hot Reload](../Features/HotReload.md) — context for the `mbproxy.config.reload.*` events.
@@ -267,29 +267,3 @@ Test names:
`DL205_5th_TCP_connection_refused`, `DL205_5th_TCP_connection_refused`,
`DL205_socket_closes_on_malformed_MBAP`. `DL205_socket_closes_on_malformed_MBAP`.
## References
1. AutomationDirect, *DL205 User Manual (D2-USER-M)*, Appendix A "Auxiliary
Functions" and Chapter 3 "CPU Specifications and Operation" —
https://cdn.automationdirect.com/static/manuals/d2userm/d2userm.html
2. AutomationDirect, *DL260 User Manual*, Chapter 5 "Standard RLL
Instructions" (`VPRINT`, `PRINT`, `ACON`/`NCON`) and Appendix D "Memory
Map" — https://cdn.automationdirect.com/static/manuals/d2userm/d2userm.html
3. Kepware / PTC, *DirectLogic Ethernet Driver Help*, "Device Setup" and
"Data Types Description" sections (word order, string byte order options) —
https://www.kepware.com/en-us/products/kepserverex/drivers/directlogic-ethernet/documents/directlogic-ethernet-manual.pdf
4. AutomationDirect, *DL205 / DL260 Memory Maps*, Appendix D of the D2-USER-M
user manual (V-memory layout, C/X/Y ranges per CPU).
5. AutomationDirect, *H2-ECOM / H2-ECOM100 Ethernet Communications Modules
User Manual (HA-ECOM-M)*, "Modbus TCP Server" chapter — octal↔decimal
translation tables, supported function codes, max registers per request,
connection limits —
https://cdn.automationdirect.com/static/manuals/hxecomm/hxecomm.html
6. Inductive Automation, *Ignition Modbus Driver — Address Mapping*, word
order options (ABCD/CDAB/BADC/DCBA) —
https://docs.inductiveautomation.com/docs/8.1/ignition-modules/opc-ua/drivers/modbus-v2
7. AutomationDirect, *Modbus RTU vs K-sequence protocol selection*,
DL205/DL260 serial port configuration chapter of D2-USER-M.
8. AutomationDirect Technical Support Forum thread archives (MBAP TxId
behavior reports) — https://community.automationdirect.com/ (search:
"ECOM100 transaction id"). _Unconfirmed_ operator reports only.

Before

Width:  |  Height:  |  Size: 47 KiB

After

Width:  |  Height:  |  Size: 47 KiB

+9 -8
View File
@@ -4,9 +4,9 @@ The pymodbus DL205 simulator stands in for real DL205/DL260 hardware in the E2E
## Why a Simulator ## Why a Simulator
`mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup. `mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../tests/sim/dl205.json`](../../tests/sim/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup.
The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../../DL260/dl205.md`](../../DL260/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation. The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../Reference/dl205.md`](../Reference/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation.
## Harness Layout ## Harness Layout
@@ -72,7 +72,7 @@ if (_sim.SkipReason is not null)
Assert.Skip(_sim.SkipReason); Assert.Skip(_sim.SkipReason);
``` ```
The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The phase-01 gate (see [`../plan/README.md`](../plan/README.md)) explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass. The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The unit-test suite's no-skip policy explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass.
The skip reasons the fixture produces map cleanly onto the recovery action: The skip reasons the fixture produces map cleanly onto the recovery action:
@@ -146,7 +146,7 @@ The connection-model rationale for why the multiplexer produces multi-frame recv
## Simulator Profile ## Simulator Profile
`DL260/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against: `tests/sim/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against:
| Address | Width | Seeded value | Used to prove | | Address | Width | Seeded value | Used to prove |
|---------|-------|--------------|---------------| |---------|-------|--------------|---------------|
@@ -155,7 +155,7 @@ The connection-model rationale for why the multiplexer produces multi-frame recv
| HR 1072 | uint16 | `0x1234` (raw BCD nibbles) | Single-register FC03 BCD decode through the proxy | | HR 1072 | uint16 | `0x1234` (raw BCD nibbles) | Single-register FC03 BCD decode through the proxy |
| HR 1080/1081 | uint16 pair | CDAB-ordered 32-bit BCD | 32-bit BCD decode across the word pair | | HR 1080/1081 | uint16 pair | CDAB-ordered 32-bit BCD | 32-bit BCD decode across the word pair |
The full register map and the device-side rationale for each entry live in [`../../DL260/dl205.md`](../../DL260/dl205.md). The full register map and the device-side rationale for each entry live in [`../Reference/dl205.md`](../Reference/dl205.md).
Two profile-level settings are load-bearing for the harness: Two profile-level settings are load-bearing for the harness:
@@ -166,7 +166,7 @@ The `write` block in the JSON controls which ranges accept FC06/FC16. Writes out
## Alternate Profiles ## Alternate Profiles
The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. The existing pattern is `DL260/DL205BcdQuirkTests.cs`, which already drives the simulator with profile-driven assertions. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated. The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated.
## Running the Simulator Standalone ## Running the Simulator Standalone
@@ -231,5 +231,6 @@ The read direction proves the proxy rewrote the response; the write direction pr
- [Connection Model](../Architecture/ConnectionModel.md) — why the multiplexer's shared backend connection produces the multi-frame condition that triggers pymodbus's framer quirk - [Connection Model](../Architecture/ConnectionModel.md) — why the multiplexer's shared backend connection produces the multi-frame condition that triggers pymodbus's framer quirk
- [Troubleshooting](../Operations/Troubleshooting.md) — hang-diagnosis pattern for tests that exceed their `[Fact(Timeout)]` - [Troubleshooting](../Operations/Troubleshooting.md) — hang-diagnosis pattern for tests that exceed their `[Fact(Timeout)]`
- [Log Events](../Reference/LogEvents.md) — `mbproxy.multiplex.request.timeout` is the production watchdog against TxId mis-echo - [Log Events](../Reference/LogEvents.md) — `mbproxy.multiplex.request.timeout` is the production watchdog against TxId mis-echo
- [DL205/DL260 device quirks](../../DL260/dl205.md) — device-side rationale for every register the simulator profile seeds - [DL205/DL260 device quirks](../Reference/dl205.md) — device-side rationale for every register the simulator profile seeds
- [Phase plan README](../plan/README.md) — Test discipline section that codifies the 5 000 ms default and the `--blame-hang-timeout` rule
Test discipline: E2E tests default to a 5 000 ms `[Fact(Timeout)]`, and `dotnet test` is run with `--blame-hang-timeout` to capture a dump on any hang.
-306
View File
@@ -1,306 +0,0 @@
# mbproxy — design plan
Architectural design for the `mbproxy` Modbus TCP proxy service: how it fronts ~54 AutomationDirect DirectLOGIC DL205/DL260 controllers, rewrites BCD tags bidirectionally inline, and recovers from listener and backend failures. Settled in a design Q&A on 2026-05-13.
**Status:** plan; no code yet. Each decision below is load-bearing — change deliberately, not by drift.
Context (what the service does and why it exists) lives in [`../CLAUDE.md`](../CLAUDE.md) under "What this is" and "Purpose: bidirectional BCD rewrite". This file is the *how*. Device quirks the design depends on live in [`../DL260/dl205.md`](../DL260/dl205.md).
Runtime shape: **.NET 10 Generic Host** worker service registered as a **Windows Service** via `Microsoft.Extensions.Hosting.WindowsServices`.
## Listener topology — per-PLC port (one port → one PLC)
The host opens **one `TcpListener` per PLC** on a distinct port. Upstream clients reach a specific PLC by connecting to its assigned proxy port; no protocol-level routing is needed.
```
Client A ──┐
Client B ──┼──→ proxy:5020 ──→ PLC #1 (10.0.1.1:502)
├──→ proxy:5021 ──→ PLC #2 (10.0.1.2:502)
│ ...
└──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502)
```
## Connection model — single backend socket per PLC, multiplexed via MBAP TxId rewriting
Each PLC has **one persistent backend TCP socket**, owned by a `PlcMultiplexer`. Many upstream client connections share that single backend socket; the multiplexer distinguishes their in-flight requests by **rewriting the MBAP transaction ID** on each request and restoring each client's original TxId on the matching response. Implemented in [Phase 09](plan/09-txid-multiplexing.md); replaced the prior 1:1 per-upstream-client backend-socket model.
```
Client A ─┐
Client B ─┼─→ proxy:5020 ─[ PlcMultiplexer ]─→ PLC #1 (10.0.1.1:502)
Client C ─┘ │ (one persistent socket)
CorrelationMap[proxyTxId]
TxIdAllocator (16-bit space)
```
- **Upstream → multiplexer**: each accepted upstream socket is wrapped in an `UpstreamPipe` (read loop + bounded response channel). The pipe's read loop hands every parsed MBAP frame to the multiplexer's `OnUpstreamFrameAsync`, which allocates a free 16-bit `proxyTxId`, stores an `InFlightRequest` in a `CorrelationMap` keyed by that proxyTxId, BCD-rewrites the request payload, overwrites the MBAP header's TxId field with `proxyTxId`, and enqueues the frame into the per-PLC outbound channel.
- **Multiplexer → backend**: a single backend writer task drains the outbound channel and sends each frame to the PLC over the shared socket. A single backend reader task reads MBAP frames back, looks each up by `proxyTxId` in the correlation map, BCD-rewrites the response, restores each interested party's original TxId, and routes the frame to that party's `UpstreamPipe._responseChannel`. The single-writer / single-reader invariant on the backend socket eliminates the need for socket-level synchronisation.
- **Per-request timeout watchdog**: a periodic task scans the correlation map at a quarter of `Connection.BackendRequestTimeoutMs` and times out any in-flight request whose response has not arrived. Timed-out requests get a Modbus exception 0x0B (Gateway Target Device Failed To Respond) delivered to their upstream party and free their allocator slot. Without this watchdog, a single lost or mis-routed response would leak a correlation entry forever and hang the upstream pipe indefinitely.
**Operational consequence (replaces the prior 4-client warning).** The H2-ECOM100's 4-concurrent-TCP-client cap (see [`../DL260/dl205.md`](../DL260/dl205.md) → Behavioral Oddities) no longer limits upstream-side connection count — the proxy holds exactly one slot per PLC regardless of how many upstream clients are attached. The wire-rate ceiling is unchanged (the ECOM internally serializes requests at ~210 ms per scan); the multiplexer shifts where serialization happens (proxy outbound queue vs PLC accept queue) rather than adding throughput.
> ⚠ **Backend disconnect cascades upstream.** When the backend socket dies (PLC reboot, network partition, middlebox idle drop), the multiplexer closes every attached upstream pipe in the same cycle and increments `BackendDisconnectCascades` by the upstream count. Clients reconnect on their own next request and the multiplexer Polly-reconnects to the backend on the first upstream frame.
> ⚠ **pymodbus 3.13.0 simulator quirk (test-only).** The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` per connection and schedules deferred handlers via `asyncio.call_soon`. Two MBAP frames arriving in the same recv buffer (as the multiplexer can produce on its shared backend connection) overwrite `last_pdu` before the first handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not suffer this — it echoes per-request TxIds correctly. Multiplexer correctness under truly concurrent backend traffic is therefore proved against a stub backend in `PlcMultiplexerTests`; the E2E suite paces requests to keep pymodbus in known-good single-PDU mode. The per-request watchdog is the production defence against any backend (real or simulated) that mis-echoes a TxId.
## Configuration — single `appsettings.json`
All configuration lives in one file, loaded via `Microsoft.Extensions.Configuration` and bound to typed POCOs. No sidecar YAML/CSV.
```jsonc
{
"Mbproxy": {
"BcdTags": {
"Global": [
{ "Address": 1072, "Width": 16 },
{ "Address": 1080, "Width": 32 }
]
},
"Plcs": [
{
"Name": "Line1-Mixer",
"ListenPort": 5020,
"Host": "10.0.1.1",
"BcdTags": {
"Add": [ { "Address": 1200, "Width": 32 } ],
"Remove": [ 1080 ]
}
},
{ "Name": "Line1-Conveyor", "ListenPort": 5021, "Host": "10.0.1.2" }
// ... 54 PLC rows
],
"AdminPort": 8080,
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000
},
"Resilience": {
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] },
"ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 }
},
"Cache": {
"AllowLongTtl": false, // gate for any tag CacheTtlMs > 60_000
"MaxEntriesPerPlc": 1000,
"EvictionIntervalMs": 5000
}
}
}
```
A BCD tag may optionally carry `CacheTtlMs` (default 0 = off); a `PlcOptions` entry may optionally carry `DefaultCacheTtlMs` (default 0 = off). Resolution order: explicit per-tag → per-PLC default → 0.
**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global Add Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must:
- reject duplicate addresses within a single PLC's resolved list
- reject 32-bit entries that would have their high register overlap a separate 16-bit entry
- warn on `Remove` entries that don't match any global tag (probably stale config)
## Configuration hot-reload
`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`, and all consumers read via `IOptionsMonitor<MbproxyOptions>` so a save to the config file propagates without restarting the service. Each change kind has explicit reconcile semantics:
| Change in appsettings | Propagation |
|-----------------------|-------------|
| `BcdTags.Global` add/remove/width | Rewriter dereferences the monitor per-PDU. Next PDU sees the new map; in-flight reads/writes are not retroactively touched. |
| `Plcs[i].BcdTags.{Add,Remove}` | Same — next-PDU resolution. |
| New `Plcs[i]` entry | Listener supervisor binds the new port subject to the same eager-then-auto-recover policy. |
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. |
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. |
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` (Phase 11) | Tag-map reseat for the affected PLC drops the entire PLC cache; entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented in v1. |
| `Cache.AllowLongTtl`, `Cache.MaxEntriesPerPlc`, `Cache.EvictionIntervalMs` (Phase 11) | `AllowLongTtl` is enforced on next reload-validation; `MaxEntriesPerPlc` applies to subsequent inserts (existing entries not pruned); `EvictionIntervalMs` is read by each fresh eviction loop. |
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta.
## BCD tag shape
```csharp
public sealed record BcdTag(ushort Address, byte Width); // Width ∈ { 16, 32 }
```
- **16-bit BCD** — one register holds 4 BCD digits (09999). Wire value `0x1234` decodes to decimal 1234.
- **32-bit BCD** — a CDAB-ordered register pair at `Address` and `Address+1`. The register at `Address` holds the **low 4 digits**; the register at `Address+1` holds the **high 4 digits**. Decoded decimal = `high * 10000 + low`. This follows directly from DirectLOGIC's CDAB word order (see [`../DL260/dl205.md`](../DL260/dl205.md) → Word Order).
- **Unsigned only.** DL205/DL260 BCD is non-negative in the default ladder pattern; the proxy does not implement signed BCD.
- **Holding-register and input-register addresses share the same space.** The rewriter applies the configured tag list against both FC03 and FC04 reads.
## Read coalescing (Phase 10)
After Phase 10, FC03 / FC04 requests are additionally subject to **in-flight read coalescing** before they reach the backend. When two or more upstream clients send the same `(unitId, fc, startAddress, qty)` tuple within the in-flight window of an already-routed request, the multiplexer attaches each late arrival to the existing `InFlightRequest.InterestedParties` list instead of opening a second backend round-trip. The single backend response is fanned out to every attached party with each party's original MBAP TxId restored individually.
Properties:
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. Coalescing alone is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision. (Phase 11 layers an opt-in cache on top — see "Response cache" below.)
- **Only FC03 / FC04.** Writes (FC06 / FC16) are non-idempotent on BCD tags and never coalesced. Different function codes never share a `CoalescingKey` even at the same address (FC03 and FC04 read different Modbus tables). Different `unitId` bytes never coalesce (different PLC personalities behind a shared socket).
- **Bounded fan-out via `MaxParties`** (default 32 in `Mbproxy.Resilience.ReadCoalescing.MaxParties`). Once an entry has `MaxParties` interested clients, the next arrival opens a fresh entry — bounds the response-fanout cost per entry at O(MaxParties) and shields the backend reader task from pathological pile-on.
- **Hot-reloadable on/off.** `Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. Flipping it to `false` at runtime leaves running coalesced entries to drain naturally; subsequent FC03/04 requests take the Phase-9 (one round-trip per upstream request) path.
- **Transparency contract preserved.** Each upstream client still sees its own original MBAP TxId on the response. The BCD rewriter runs once on the shared response buffer; per-party copies are only made when fan-out has more than one party.
Counter accounting balance (per snapshot): `coalescedHitCount + coalescedMissCount` equals the total FC03 + FC04 requests seen since the multiplexer was constructed. Both counters increment regardless of whether the coalescing feature is enabled — `coalescedHitCount` is 0 when disabled, but every read still increments `coalescedMissCount`.
## Response cache (Phase 11) — opt-in bounded-staleness cache
**⚠ Design-contract pivot.** Through Phase 10 the proxy is *purely transparent* — every upstream read corresponds 1:1 to a recent backend round-trip (or, with Phase 10, to a peer's in-flight backend round-trip in the same microseconds-to-milliseconds window). Phase 11 changes that contract: the proxy gains an **opt-in per-tag response cache** that may serve upstream FC03/FC04 reads from in-process memory with bounded staleness up to the operator-configured `CacheTtlMs`. **The cache is OFF by default** (`CacheTtlMs = 0` on every BCD tag unless explicitly set); a fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. Operators opt tags in explicitly as their acknowledgement of the staleness window.
### Cache contract
- **Per-tag TTL.** Each BCD tag carries an optional `CacheTtlMs` (in `BcdTagOptions`). `CacheTtlMs = 0` (the default) disables caching for that tag. The TTL resolution order is **explicit per-tag → per-PLC `DefaultCacheTtlMs` → 0**.
- **Multi-tag read range: effective TTL = `min(TTLs)`.** When a single FC03/FC04 read covers multiple configured tags, the cache uses the smallest TTL among them. If any tag in the read range has `CacheTtlMs = 0`, the **whole read is uncached** — the conservative-by-design choice.
- **Lookup order: cache → coalesce → backend.** A cache hit short-circuits Phase 10's coalescing entirely. Only on a miss does the request engage coalescing (Phase 10) and then the Phase 9 backend send path.
- **Cache populates on demand only.** No polling, no predictive prefetch. Entries are created in the backend reader task **after** the BCD rewriter has run on the response — the cache stores **POST-rewriter bytes**, so hits never re-invoke the rewriter (CPU win + behaviour-stable).
- **Write invalidation by ADDRESS RANGE OVERLAP.** A successful FC06 / FC16 response (non-exception) invalidates every cached FC03/FC04 entry whose address range `[StartAddress, StartAddress + Qty)` overlaps the write range. A write to register 105 invalidates a cached `[100..110]` read but not a cached `[200..210]` read. Exception responses do not invalidate (the write didn't take effect).
- **Different unit IDs never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3,4})`.
- **Cache survives backend disconnects.** A cached entry's data was valid when stored; a disconnect does not retroactively invalidate it. Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid).
- **No persistence.** Process restart wipes the cache. No file/Redis backing store, no last-known-good snapshot.
- **Hot-reload flushes the entire PLC cache.** Any tag-list change to a PLC drops every cached entry for that PLC. Per-tag flush granularity is intentionally not done in v1 — the simple correctness move is "any tag-list reload → drop all entries for the affected PLC and let them re-populate."
- **TTL > 60 s requires `Cache.AllowLongTtl = true`.** Validation rejects reloads that set `CacheTtlMs > 60_000` without this opt-in. Prevents "left at 1 hour by accident" deployments.
- **LRU-bounded capacity.** Each PLC's cache is capped at `Cache.MaxEntriesPerPlc` (default 1000). When full, the next insert evicts the least-recently-used entry. A background eviction loop (interval `Cache.EvictionIntervalMs`, default 5000) also scans for expired entries.
### Cache and the rewriter
The BCD rewriter runs once on the cache-miss path (the backend reader task decodes the response and stores the decoded bytes in the cache). Cache hits return pre-decoded bytes directly without re-invoking the rewriter — this is both a CPU optimisation and a correctness guarantee (any future rewriter change would not retroactively re-transform an entry that was decoded against an earlier rewriter version).
### Hot-reload semantics
| Change | Cache behaviour |
|--------|----------------|
| Tag's `CacheTtlMs` changed (any direction, 0 → N, N → 0, N → M) | Entire PLC cache is flushed; entries re-populate on demand under the new TTL. |
| New PLC added / removed | New PLC starts with empty cache; removed PLC's cache is discarded with the multiplexer. |
| `Cache.AllowLongTtl` flipped | Validation runs on next reload; existing entries unaffected. |
| `Cache.MaxEntriesPerPlc` changed | Existing entries unaffected; cap applies to subsequent inserts. |
| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues until next dispose; subsequent loops use new interval. |
### Counter accounting
- `cacheHitCount` — FC03/FC04 requests served from the cache.
- `cacheMissCount` — FC03/FC04 requests that fell through to the coalescing/backend path. (Cache hit + Cache miss = total FC03/FC04 requests that were cache-eligible, i.e. whose resolved TTL was > 0; reads whose effective TTL is 0 increment neither.)
- `cacheInvalidations` — count of cache entries invalidated by FC06/FC16 write responses.
- `cacheEntryCount` — point-in-time snapshot of `ResponseCache.Count` (Tier-2 memory-watch KPI).
- `cacheBytes` — point-in-time approximation of cached PDU bytes (Tier-2 memory-watch KPI).
## Rewriter — function code scope
The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte:
| FC | Direction | Action |
|----|----------------|-----------------------------------------------------------------------|
| 03 | request + response | FC03 requests may be coalesced with peers before reaching the backend (see Phase-10 section above); response re-encodes covered BCD slots from raw nibbles → binary integer |
| 04 | request + response | Same coalescing eligibility as FC03; response re-encoding the same as FC03 (input-register table also surfaces V-memory) |
| 06 | request | Re-encode binary integer → BCD nibbles before forwarding |
| 06 | response | Decode BCD nibbles → binary integer on the echo (clients validate that the echoed value equals the value they sent; without this, NModbus-style clients throw on the round-trip) |
| 16 | request | Per-register over the configured slots, then forward |
**Partial-overlap policy.** A request that touches only ONE register of a configured 32-bit BCD pair (qty=1 at the low addr, or any read/write of the high addr alone) **passes through raw** with a `mbproxy.rewrite.partial_bcd` warning. The proxy never synthesises a Modbus exception for a partial-overlap — that response code is reserved for transport failure.
## Failure modes — transparent pass-through with Polly-bounded backend connect
- **PLC returns a Modbus exception (codes 0104)** → forward verbatim with the original MBAP transaction ID. The client sees the real DL205/DL260 exception.
- **Backend connect refused or initial connect timeout** → retry under a Polly resilience pipeline: 3 attempts at 100ms / 500ms / 2000ms backoff (tuned via `Resilience.BackendConnect`). If all attempts fail, the multiplexer closes the upstream client connection that triggered the connect.
- **Backend mid-stream broken socket** → the multiplexer's reader/writer task throws; the backend tear-down path cancels both tasks, drains the correlation map, and **cascades the disconnect by closing every attached upstream pipe**. The next upstream request to any pipe triggers a fresh backend connect through the Polly pipeline. `BackendDisconnectCascades` counter records the upstream-pipe count at each cascade event.
- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path.
- **Partial-BCD overlap** → forward raw + warn (see Rewriter section).
- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients.
- **Cache during backend recovery (Phase 11).** Cache hits remain valid during a `recovering` listener state — the data was fresh when cached, and recovery only affects future requests. Writes that arrive during recovery never reach the backend, so the invalidation never happens. This is consistent: the write also didn't take effect on the PLC. Cached entries simply remain until their TTL expires.
## Startup posture — eager, continue on per-port failure
At startup the host attempts to bind **all 54 listen sockets up front**. Each failure (port already in use, invalid IP, malformed PLC entry) is logged at Error and handed off to the listener supervisor (next section). The service proceeds with whichever PLCs bound on the first attempt; the rest converge in the background. Monitoring should alert on `mbproxy.startup.bind.failed` so missing PLCs aren't silently dropped, and watch for `mbproxy.listener.recovered` to confirm late binds eventually succeeded.
## Listener auto-recovery (Polly-backed supervisor)
Each PLC's listener runs under a **supervisor task** that owns its bind lifecycle. If a bind fails at startup, or if a listener faults at runtime (port stolen by another process, transient OS network reset), the supervisor reattempts via a Polly retry pipeline: 5 attempts at 1s / 2s / 5s / 15s / 30s backoff, then steady-state retries every 30s indefinitely (tuned via `Resilience.ListenerRecovery`). Each attempt logs at Debug; the bind that finally succeeds emits one `mbproxy.listener.recovered` Information event.
While a supervisor is between attempts, the corresponding PLC is reported as `listener.state = recovering` on the status page. Hot-reload uses the same supervisor to bring newly-added PLCs online and to tear down removed ones — there is exactly one code path for "bring up a listener" and one for "shut a listener down."
## Logging — Serilog, structured, console + rolling file
Serilog wired through the Microsoft.Extensions.Logging bridge:
- **Console sink** for interactive `--console` runs.
- **Rolling-file sink** under `%ProgramData%\mbproxy\logs\`.
- **Windows Event Log sink** for Error+ events when the service is running under `Microsoft.Extensions.Hosting.WindowsServices`.
- **Default level** Information. Properties (`Plc`, `RemoteEp`, etc.) are emitted per message via `[LoggerMessage]` templates so log lines are greppable across the fleet.
Event names follow the convention `mbproxy.<area>.<noun>[.<state>]` and are part of the operator contract — once shipped they don't churn (renames require a major version bump). The full catalog of stable event names, their levels, properties, and operator implications lives in [`Reference/LogEvents.md`](Reference/LogEvents.md); each `*LogEvents.cs` static class (e.g. `MultiplexerLogEvents`, `CoalescingLogEvents`, `CacheLogEvents`, `RewriterLogEvents`) is the source of truth.
## Status page — read-only HTTP endpoint
A separate **Kestrel-hosted minimal API** runs on `Mbproxy.AdminPort` (default `8080`, distinct from the Modbus listen ports). The endpoint set is intentionally narrow — read-only telemetry; **no admin actions** (kick client, force reload, restart listener) are exposed:
- `GET /` — single self-contained HTML page rendering a table of all configured PLCs with their state and live counters. Auto-refreshes every 5s via a meta-refresh tag (no JS bundle, no external assets).
- `GET /status.json` — the same data as JSON for monitoring scrapers.
Authentication is assumed to live at the network layer (trusted internal segment behind a firewall). Surface that assumption in deployment docs when they exist.
**Service-wide fields:**
| Field | Meaning |
|-------|---------|
| `service.uptime` | Seconds since service start |
| `service.version` | Assembly informational version |
| `service.config.lastReloadUtc` | Timestamp of last accepted hot-reload (or `null`) |
| `service.config.reloadCount` | Number of reloads accepted since start |
| `service.config.reloadRejectedCount` | Number of reloads rejected since start |
| `listeners.bound` / `listeners.configured` | Bound listener count vs configured PLC count |
**Per-PLC fields** (one row per `Plcs[i]`):
| Field | Meaning |
|-------|---------|
| `name`, `host`, `listenPort` | Identity from config |
| `listener.state` | `bound` / `recovering` / `stopped` |
| `listener.lastBindError` | Most recent bind failure message (when `recovering`) |
| `listener.recoveryAttempts` | Polly retry count since last successful bind |
| `clients.connected` | Currently connected upstream client count |
| `clients.remoteEndpoints` | Array of `{ remote, connectedAtUtc, pdusForwarded }` |
| `pdus.forwarded` | Total PDUs (request+response) forwarded since start |
| `pdus.byFc` | `{ fc03, fc04, fc06, fc16, other }` request counts |
| `pdus.rewrittenSlots` | Count of register slots BCD-rewritten |
| `pdus.partialBcdWarnings` | Count of partial-overlap pass-throughs |
| `backend.connects.success` / `backend.connects.failed` | Polly-final-result counters |
| `backend.exceptions.byCode` | `{ "01": n, "02": n, "03": n, "04": n }` |
| `backend.lastRoundTripMs` | EWMA of recent successful round-trip times |
| `backend.coalescedHitCount` | FC03/04 requests that attached to an already-in-flight peer (Phase 10) |
| `backend.coalescedMissCount` | FC03/04 requests that opened a fresh backend round-trip (Phase 10). `Hit + Miss` = total FC03/04 requests |
| `backend.coalescedResponseToDeadUpstream` | Coalesced fan-out responses skipped because the attached upstream had already disconnected (Phase 10) |
| `backend.cacheHitCount` | FC03/04 reads served from the response cache (Phase 11) |
| `backend.cacheMissCount` | FC03/04 reads that fell through to coalescing/backend after a cache miss (Phase 11) |
| `backend.cacheInvalidations` | Cache entries invalidated by overlapping FC06/FC16 write responses (Phase 11) |
| `backend.cacheEntryCount` | Point-in-time snapshot of the per-PLC cache's entry count (Phase 11, Tier-2 memory-watch) |
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC (Phase 11, Tier-2 memory-watch) |
| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction |
Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path.
## Test simulator — pymodbus DL260/DL205 server
The pymodbus profile at [`../DL260/dl205.json`](../DL260/dl205.json) already models the DL205/DL260 quirks (BCD nibbles at known addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings, etc.) as concrete register seeds. The test infrastructure wraps it as a managed lifecycle so every integration / e2e test gets a fresh known-good DL-series target without needing real hardware.
Harness shape (lives under `tests/sim/`):
- **Launcher script**`tests/sim/run-dl205-sim.ps1` provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`), then launches `pymodbus.server` with the `dl205.json` profile on a configurable port. Idempotent: re-runs reuse the venv.
- **xUnit fixture**`Mbproxy.Tests.Sim.DL205SimulatorFixture : IAsyncLifetime` that:
- `InitializeAsync`: spawns the simulator subprocess, polls `TcpClient.ConnectAsync` against the port until success or a 10 s deadline, captures stdout/stderr to test output.
- `DisposeAsync`: signals graceful shutdown (Ctrl-C on the process group on Windows), then `Process.Kill(entireProcessTree: true)` as a safety net.
- Exposes `Host`, `Port`, `LogTail` (last N lines of sim stderr for diagnosis).
- **Test collection**`[CollectionDefinition(nameof(DL205SimulatorCollection))]` so the fixture is shared across all integration/e2e classes that opt in (cheap startup, expensive process churn).
- **Skip policy** — if Python or pymodbus isn't available and the auto-provision fails (no network, locked-down CI image, etc.), `InitializeAsync` records the reason and tests skip via `Assert.Skip(sim.SkipReason)`. CI must have Python 3.10+ available; local devs running only the rewriter unit tests need nothing extra.
- **Alternate profiles** — additional scenarios (e.g., a profile that seeds a specific partial-overlap test case, or a profile with strict `type exception: true` to verify the proxy doesn't depend on lax pymodbus behaviour) live alongside `dl205.json` and are selected via `MODBUS_SIM_PROFILE` env var, matching the pattern already established by [`../DL260/DL205BcdQuirkTests.cs`](../DL260/DL205BcdQuirkTests.cs).
The simulator IS the proxy's end-to-end test bed. A standard e2e test does:
1. Start the simulator at `127.0.0.1:<simPort>`.
2. Configure the proxy with one PLC entry `Host=127.0.0.1, Port=<simPort>, ListenPort=<proxyPort>`.
3. Start the proxy (in-process via `WebApplicationFactory`-style host construction).
4. Drive a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `127.0.0.1:<proxyPort>`.
5. Assert two directions:
- **Read**: client sees the BCD-decoded integer (proxy rewrote the response).
- **Write**: simulator's register state shows the BCD-encoded nibbles (proxy rewrote the request).
## Testing
- **Unit tests** — drive the BCD rewriter with synthetic Modbus PDU byte arrays. No network, no simulator. Cover every FC03/04/06/16 × {single 16-bit, full 32-bit pair, partial-overlap low, partial-overlap high, mixed-with-non-BCD} cell.
- **Integration tests** — drive the proxy end-to-end against the pymodbus simulator described in the previous section, using a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `proxy:<listenPort>` and asserting the decoded value rather than the raw register bytes.
- **Auto-recovery tests** — bind a `TcpListener` on a target port BEFORE starting the proxy, assert that the supervisor enters `recovering` state, release the port, and assert the next supervisor attempt succeeds and `mbproxy.listener.recovered` fires. Also cover the runtime-fault path by forcing the accept loop to throw and asserting the supervisor reattempts.
- **Hot-reload tests** — write a temp `appsettings.json`, start the host, mutate the file (add a PLC, remove a PLC, change a global tag width), and assert: (a) supervisor adds/removes the affected listener, (b) the rewriter on the next PDU reflects the new tag map, (c) a malformed reload is rejected without breaking the running config. Cover both `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` paths.
- **Status page tests** — start the host, induce known events (connect 2 clients, force a backend exception, trigger a partial-BCD warning), and assert `GET /status.json` returns the expected counters. The HTML page is verified separately as a smoke test that the route returns 200 with `text/html`.
-408
View File
@@ -1,408 +0,0 @@
# mbproxy — Dashboard KPI catalogue
Recommended additions to the `/status.json` and `/` admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's `/status.json` exposes raw cumulative counters; this doc describes what's typically *also* expected when those counters land in Grafana / Wonderware / a custom HMI.
**Scope.** This is a proposal, not a contract. The endpoint shape settled in [`design.md`](design.md) → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.
**Reading guide.** Each KPI has:
- **Name** — short identifier matching the proxy's existing camelCase convention.
- **Definition** — what the number means.
- **Source** — where the value comes from (existing counter, new counter, derived).
- **Widget** — typical dashboard visualisation.
- **Alert** — common threshold or anomaly rule (where applicable).
- **Effort** — implementation cost in hours (rough order-of-magnitude).
## What's exposed today (recap)
For context — every recommended addition below is *in addition to* this list. Today's `/status.json` carries:
| Group | Fields |
|-------|--------|
| Service | `uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount` |
| Listeners | `bound`, `configured` |
| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` |
| Per-PLC bytes | `upstreamIn`, `upstreamOut` |
Counters are **cumulative since process start**. A restart resets them.
---
## Tier 1 — strongly recommended for production
These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."
### 1.1 Rate metrics (per-PLC and fleet-wide)
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.ratePerSec.last1m` | PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h |
| `pdus.ratePerSec.last5m` | Same over 5 min | Same buffer at 300 s | Sparkline | None | shared |
| `errors.ratePerMin` | Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute | Derived | Stat tile per PLC | > 10/min → page | 2 h |
| `bytes.ratePerSec.up` / `.down` | Bandwidth each direction | Derived from `bytesUpstreamIn/Out` deltas | Stacked area | None — informational | 2 h |
| `fleet.totalPdusPerSec` | Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h |
**Why this matters.** Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing `rate(pdus_forwarded[1m])` on a 54-row fleet is the single most informative widget on the dashboard.
**Implementation note.** Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in `/status.json` directly, add a per-PLC `Mbproxy.Proxy.RateTracker` with a fixed-size circular buffer of 60 one-second samples and expose `RatePerSec1m`, `RatePerSec5m`.
### 1.2 Latency percentiles (replacing the bare EWMA)
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.roundTripMs.p50` | Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h |
| `backend.roundTripMs.p95` | 95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared |
| `backend.roundTripMs.p99` | 99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared |
| `backend.roundTripMs.max1m` | Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared |
**Why this matters.** The existing `lastRoundTripMs` is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.
**Implementation note.** Use `Mbproxy.Proxy.LatencyReservoir` — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).
### 1.3 Per-PLC availability ratio
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `listener.boundRatio.last1h` | Fraction of time in `bound` state over last hour | New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h |
| `listener.boundRatio.sinceStart` | Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared |
| `listener.timeInRecoveringMs.last1h` | Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared |
**Why this matters.** `recoveryAttempts` tells you how many times something has flapped, but not how *much* downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.
**Implementation note.** Each `PlcListenerSupervisor` already has a state machine. Add a `StateDurationTracker` that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.
### 1.4 Liveness / staleness signals
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.lastForwardedUtc` | Wall time of the most recent forwarded PDU | New `_lastForwardedTimestamp` per PLC | Stat tile | `now - value > 5 min AND clients.connected > 0` → page | 1 h |
| `clients.lastActivityUtc` | Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h |
| `staleClients.count` | Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h |
**Why this matters.** Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with `clients.connected = 2` but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.
### 1.5 Service-wide fleet aggregates
These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `fleet.plcsHealthy` | Count of PLCs in `bound` state with no errors in last 5 min | Aggregate | Big number, green | < `listeners.configured - 2` → warn | 2 h |
| `fleet.plcsRecovering` | Count in `recovering` state | Aggregate | Big number, orange | > 0 → informational | shared |
| `fleet.plcsStopped` | Count in `stopped` state | Aggregate | Big number, grey | > 0 → page | shared |
| `fleet.plcsWithActiveErrors` | Count with `errors.ratePerMin > 0` | Aggregate | Big number, red | > 0 → page | shared |
| `fleet.totalClientsConnected` | Sum of `clients.connected` | Aggregate | Stat tile | None | 1 h |
| `fleet.totalRewrittenSlotsPerSec` | Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared |
**Why this matters.** A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.
### 1.6 Multiplexer state — **shipped in [Phase 9](plan/09-txid-multiplexing.md)**
The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.inFlightCount` | Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) |
| `backend.maxInFlight` | Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) |
| `backend.txIdWraps` | Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) |
| `backend.queueDepth` | Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) |
| `backend.disconnectCascades` | Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with `mbproxy.backend.failed` events | (in Phase 9 scope) |
**Why this matters.** Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's `lastRoundTripMs` measures wire latency only; queue depth reveals proxy-side backlog).
### 1.7 Read coalescing — **shipped in [Phase 10](plan/10-read-coalescing.md)**
Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. `coalescedHitCount + coalescedMissCount` equals total FC03/04 request count per snapshot — the math always balances.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.coalescedHitCount` | FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescedMissCount` | FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescingRatio` | `Hit / (Hit + Miss)` over the trailing window | Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) |
| `backend.coalescedResponseToDeadUpstream` | Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) |
**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)**
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry).
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.cacheHitCount` | FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheMissCount` | FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheHitRatio` | `Hit / (Hit + Miss)` for cache-eligible reads | Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) |
| `backend.cacheInvalidations` | Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) |
**Why this matters.** Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.
---
## Tier 2 — nice-to-have
Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.
### 2.1 Connection-cap saturation warning
> **Status: superseded by [Phase 9](plan/09-txid-multiplexing.md).** This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is `backend.inFlightCount` (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. **Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.**
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `clients.atCapWarning` | Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap) | Derived | Cell highlight | True → warn | 1 h |
| `clients.atCapBlocked` | Boolean: `clients.connected >= 4` (cap reached) | Derived | Cell highlight | True → page | shared |
**Why this mattered (pre-Phase-9).** The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see [design.md](design.md) → "Connection model" and [DL260/dl205.md](../DL260/dl205.md) → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.
### 2.2 Error breakdown / heatmap
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `partialBcd.byClient` | Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h |
| `invalidBcd.byAddress` | Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h |
| `exceptions.byCodeRate` | Per-exception-code rate over 5 min | Derived from `exceptionsByCode.*` | Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h |
**Why this matters.** Once you've seen `partialBcdWarnings = 1247`, the next question is *which client* and *which tag*. Without dimensional breakdown, you have to ssh into the log file to find out.
### 2.3 Hot-reload cadence
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `config.reloadsPerHour` | Reload events per hour | Derived from `configReloadCount` | Sparkline | > 10/hr → unusual; misconfig loop? | 1 h |
| `config.lastReloadDelta` | Summary of what changed on last reload | Already in `mbproxy.config.reload.applied` event; surface here | Text snippet | None — informational | 2 h |
**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)**
When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) |
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |
**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.
### 2.4 Memory / process health
| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `process.workingSetMb` | `Process.GetCurrentProcess().WorkingSet64 / 1MB` | New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h |
| `process.gcCollections.gen0/1/2` | GC counts per generation | `GC.CollectionCount(n)` | Sparkline | Gen-2 frequency → memory pressure | 0.5 h |
| `process.threadCount` | `Process.Threads.Count` | New | Stat tile | > 200 → leak? | 0.5 h |
**Why this matters.** A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one `Process` API call, no perf overhead.
---
## Real-time updates via SignalR
Today's status surface is poll-based: the HTML page uses a 5-second `meta-refresh`, and Prometheus / custom HMI scrapers hit `/status.json` on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a **live fleet dashboard with many panels open**, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping `bound → recovering`) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.
**The recommendation is additive, not replacement.** Keep `/status.json` for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.
### Why this is cheap to add
The `Microsoft.AspNetCore.App` framework reference that Phase 07 added to the csproj **already includes `Microsoft.AspNetCore.SignalR`** — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on `Mbproxy.AdminPort`. No additional port, no additional listener supervision, no additional shutdown path.
### Architecture
```
┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐ │
ConfigReconciler ───────────┤ │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘ (background loop + │
immediate-push paths) └─→ Dashboard C (subscribed to "service")
```
- **`StatusHub : Hub`** — the SignalR endpoint mounted at `/hub/status` on `AdminPort`. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.
- **`StatusBroadcaster : IHostedService`** — the background pusher. Holds a `Timer` (or `PeriodicTimer`) that ticks at `PushIntervalMs` (default 1000 ms), builds a `StatusResponse` via the existing `StatusSnapshotBuilder`, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes `PushEventAsync(name, props)` for the immediate-push paths.
- **Immediate-push wiring** — the existing log events (`mbproxy.listener.recovered`, `mbproxy.config.reload.applied`, `mbproxy.backend.failed`, `mbproxy.rewrite.partial_bcd`, etc.) gain a fan-out call to `broadcaster.PushEventAsync(...)` so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.
### Hub contract
**Hub URL:** `https://<host>:<AdminPort>/hub/status`
**Hub groups** — clients subscribe to scopes; the server broadcasts to matching groups:
| Group | Receives |
|-------|----------|
| `all` | Every update for every PLC + every service-level event |
| `service` | Service-level events only (`mbproxy.config.*`, `mbproxy.admin.*`, `mbproxy.startup.*`, `mbproxy.shutdown.*`) |
| `plc:<Name>` | One PLC's snapshots + that PLC's events |
**Server-side methods** (client → server):
| Method | Purpose |
|--------|---------|
| `Task SubscribeFleet()` | Join group `all` |
| `Task SubscribeService()` | Join group `service` |
| `Task SubscribePlc(string name)` | Join group `plc:<name>` after validating that `name` exists in current options |
| `Task Unsubscribe()` | Leave every group; the connection stays open but receives nothing |
**Client-side callbacks** (server → client, named `On*` per SignalR convention):
| Callback | Payload | When |
|----------|---------|------|
| `OnSnapshot(StatusResponse snapshot)` | Full snapshot of the relevant scope (`all`, `service`, or a single PLC) | Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect |
| `OnPatch(StatusPatch patch)` | Delta of fields that changed since the last push | Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed |
| `OnEvent(StatusEvent ev)` | Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }` | Immediately — fan-out from the existing `[LoggerMessage]` event call sites |
`StatusPatch` carries only the fields that changed since the previous push: it's a `Dictionary<string, JsonElement>` keyed by JSON path (e.g., `"plcs[2].pdus.forwarded"`, `"plcs[2].listener.state"`). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.
### What gets pushed, and when
| Update kind | Cadence | Volume per PLC | Channel |
|-------------|---------|----------------|---------|
| Counter increments (PDUs, bytes, rewrites) | Every `PushIntervalMs` if changed; coalesced | 1 patch / push tick / subscribed group | `OnPatch` |
| State transitions (`bound ↔ recovering ↔ stopped`) | Immediate | 1 event + 1 patch | `OnEvent` + `OnPatch` |
| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | `OnEvent` |
| Hot-reload applied / rejected | Immediate | 1 event with `propertiesJson` summary | `OnEvent` |
| Periodic full snapshot | Every 60 s | 1 full snapshot | `OnSnapshot` |
The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.
### Configuration
Extend `appsettings.json` with:
```jsonc
"Mbproxy": {
// ... existing keys ...
"Admin": {
"SignalR": {
"Enabled": true,
"PushIntervalMs": 1000, // patch cadence
"FullSnapshotIntervalMs": 60000, // periodic re-baseline
"MaxConcurrentClients": 32, // refuse new connections beyond this
"MaxGroupsPerClient": 8 // anti-runaway-subscription guard
}
}
}
```
Defaults make the feature opt-in-able-by-omission: if `SignalR.Enabled = false`, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.
### Implementation outline
1. **Hub class**`src/Mbproxy/Admin/StatusHub.cs`. Inherits `Hub`. Implements the four `Subscribe*` / `Unsubscribe` methods. `OnConnectedAsync` rejects if `Context.Items.Count > MaxConcurrentClients` (track in a static `ConcurrentDictionary<string, byte>` indexed by `ConnectionId`).
2. **Broadcaster**`src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService`. Constructor takes `IHubContext<StatusHub>`, `StatusSnapshotBuilder`, `IOptionsMonitor<MbproxyOptions>`. The push loop is a `while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }` body — wins over `Timer` for cancellation correctness.
3. **DTOs**`StatusPatch` and `StatusEvent` records added to `StatusDto.cs`, registered with the source-gen `StatusJsonContext`.
4. **Event fan-out** — the existing `[LoggerMessage]` partial methods stay; add a thin `RealtimeLogEvents` wrapper class that logs AND calls `broadcaster.PushEventAsync(...)`. Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable.
5. **Hub mapping**`AdminEndpointHost` adds `app.MapHub<StatusHub>("/hub/status")` if `SignalR.Enabled`. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint.
6. **Shutdown**`StatusBroadcaster.StopAsync` cancels its pump and the hub's `Dispose` chain handles connection teardown. The existing `ShutdownCoordinator` deadline applies.
### Test approach
Use the **`Microsoft.AspNetCore.SignalR.Client`** package (NuGet) in the test csproj only. Pattern:
```csharp
[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
// Arrange: start host on a random AdminPort, build a SignalR client.
var connection = new HubConnectionBuilder()
.WithUrl($"http://localhost:{adminPort}/hub/status")
.Build();
var patches = new ConcurrentQueue<StatusPatch>();
connection.On<StatusPatch>("OnPatch", patches.Enqueue);
await connection.StartAsync(TestContext.Current.CancellationToken);
await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);
// Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
// ... drive request through proxy ...
// Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
var deadline = DateTime.UtcNow.AddMilliseconds(500);
while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
await Task.Delay(20, TestContext.Current.CancellationToken);
patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}
```
Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.
Coverage targets for the new tests:
1. `SignalR_Subscribe_DeliversInitialSnapshot`
2. `SignalR_Patch_FiresWithinPushInterval_AfterCounterChange`
3. `SignalR_Event_FiresWithin_100ms_OfListenerRecovered`
4. `SignalR_SubscribePlc_OnlyDeliversThatPlcEvents` — verifies group filtering
5. `SignalR_MaxConcurrentClients_RefusesExcess` — capacity guard
6. `SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs`
### Operational considerations
- **Authentication / authorisation.** Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
- **Transport.** SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
- **Backpressure.** `Hub.Clients.Group("all").SendAsync` does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy.
- **Reconnection.** The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
- **Cardinality at scale.** 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The `MaxConcurrentClients` guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy.
- **CORS.** If dashboards run on a different origin (likely), enable CORS on the admin app for `/hub/status` only. Add `AdminCors.AllowedOrigins` to `appsettings.json` as an array of allowed origin strings; an empty array means same-origin only.
- **Logging.** SignalR's internal logs are noisy at Information. In `appsettings.json`, set the `Microsoft.AspNetCore.SignalR` category to `Warning` and `Microsoft.AspNetCore.Http.Connections` to `Warning` so the proxy's own event stream isn't drowned out.
### Effort estimate
| Work | Hours |
|------|-------|
| Hub + DTOs + broadcaster | 6 h |
| Event fan-out wiring (existing log events) | 3 h |
| AdminEndpointHost integration + appsettings binding | 2 h |
| E2E test suite (6 tests using SignalR .NET client) | 4 h |
| Documentation (this section graduates from proposal to fact; design.md update) | 1 h |
| **Total** | **~16 h** |
This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.
---
## Implementation notes
### Where rates and percentiles should live
Two reasonable answers:
1. **Compute in the proxy, expose pre-computed values in `/status.json`.** Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change.
2. **Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates.** Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.
**Recommendation:** ship Tier 1 rate metrics computed in-process for the operator who just opens `http://<host>:8080/` in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.
### Counter additions vs computed values
A few proposed KPIs require **new counters in `ProxyCounters` or `ServiceCounters`**, not just derivations:
- `pdus.lastForwardedUtc` — new `volatile long _lastForwardedTicks` on `ProxyCounters`.
- `listener.boundRatio.*` — new `StateDurationTracker` on `PlcListenerSupervisor`.
- `partialBcd.byClient` / `invalidBcd.byAddress` — new `ConcurrentDictionary<string,long>` / `ConcurrentDictionary<ushort,long>` on `PerPlcContext`. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).
- `process.*` — read fresh on every snapshot from `Process.GetCurrentProcess()` — no stored state.
### Snapshot serialization cost
`StatusResponse` is built per-request to `/status.json`. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., `invalidBcd.byAddress`) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep `/status.json` under a few hundred KB even when something goes badly wrong.
### Dashboard widget mapping (Grafana-style cheat sheet)
| Widget | Use for |
|--------|---------|
| **Stat (big number)** | Service-wide aggregates, counts, latest timestamps |
| **Gauge** | Ratios (availability, success rate, queue depth) |
| **Sparkline** | Rates, percentiles, time-series trends |
| **Stacked area** | Bandwidth, PDU-by-FC breakdown over time |
| **Heatmap** | Per-address / per-client dimensional breakdowns |
| **Cell-coloured table** | Per-PLC status (54 rows, one per PLC, columns of KPIs) |
### Backwards-compat policy
The fields currently in `/status.json` are **frozen** — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in [`design.md`](design.md) → "Status page" as the contract; new fields ship via PRs that update the contract first.
## Cross-references
- Field tables for what ships today: [`design.md`](design.md) → "Status page".
- Stable log event names (some KPIs are derivable by tailing these): [`design.md`](design.md) → "Logging" event-name table.
- Per-counter wiring lives in `src/Mbproxy/Proxy/ProxyCounters.cs` and `src/Mbproxy/ServiceCounters.cs`.
- The status HTML page is rendered by `src/Mbproxy/Admin/StatusHtmlRenderer.cs`; the JSON DTOs and source-gen context live in `src/Mbproxy/Admin/StatusDto.cs`.
-176
View File
@@ -1,176 +0,0 @@
# mbproxy operations runbook
Day-two operations reference for the mbproxy Windows Service: install, upgrade, configuration, logs, and troubleshooting.
## Install
### Prerequisites
- Windows 10 / Server 2019 or later (64-bit).
- PowerShell 5.1+ run as Administrator (the install script uses `#Requires -RunAsAdministrator`).
- The compiled publish output from `dotnet publish` (see [README.md](../README.md) for the exact command).
- Modbus TCP reachable from the proxy host to the PLCs on port 502.
- Port 8080 (or whatever `AdminPort` is set to) available for the status page.
### Steps
1. Publish the binaries on the build machine:
```powershell
dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true -o C:\build\mbproxy-publish
```
2. Copy the publish output to the target server (or run the install script locally if you built on the server).
3. Open an elevated PowerShell prompt and run the install script:
```powershell
.\install\install.ps1 -PublishOutput C:\build\mbproxy-publish -Start
```
The script:
- Copies binaries to `C:\Program Files\Mbproxy\` (configurable via `-InstallPath`).
- Registers the service with `sc.exe create`.
- Sets failure-recovery: restart after 60 s on first/second failure, no action on third.
- Creates `%ProgramData%\mbproxy\logs\` and sets ACLs if needed.
- Copies `mbproxy.config.template.json``%ProgramData%\mbproxy\appsettings.json` **only if no config exists**.
- Registers the Windows Event Log source `mbproxy`.
- With `-Start`, starts the service and waits up to 30 s for `RUNNING` state.
4. Edit `%ProgramData%\mbproxy\appsettings.json` to configure your PLC list and BCD tags. See the template for inline comments on every field.
5. If you edited the config before starting, start the service:
```powershell
sc.exe start mbproxy
```
6. Verify (smoke checklist — see [Smoke checklist](#first-install-smoke-checklist) below).
### Re-running install on an existing installation
The install script is idempotent. Re-running it:
- Stops the service if running.
- Overwrites the binaries.
- Updates the service config via `sc.exe config` (not `sc.exe create`).
- Preserves `%ProgramData%\mbproxy\appsettings.json` (never overwritten on update).
- Skips Event Log source creation if already registered.
## Upgrade procedure
1. Publish new binaries on the build machine (same command as install step 1).
2. Stop the service:
```powershell
sc.exe stop mbproxy
```
Wait for the service to reach `STOPPED` state — graceful shutdown drains in-flight PDUs (up to `Connection.GracefulShutdownTimeoutMs`, default 10 s).
3. Copy new binaries to `C:\Program Files\Mbproxy\` (or run `install.ps1 -PublishOutput ...` to automate steps 24):
```powershell
Copy-Item -Path C:\build\mbproxy-publish\* -Destination 'C:\Program Files\Mbproxy\' -Force
```
4. Start the service:
```powershell
sc.exe start mbproxy
```
5. Check the status page to confirm the new version:
```powershell
Invoke-RestMethod http://localhost:8080/status.json | Select-Object -ExpandProperty service
```
The `version` field should show the new build.
## Uninstall
```powershell
.\install\uninstall.ps1
```
Options:
- `-KeepConfig` — preserves `%ProgramData%\mbproxy\appsettings.json` for re-install.
- Log files are **always archived** to `%ProgramData%\mbproxy.archived-<timestamp>\logs\` regardless of `-KeepConfig`. They are never deleted.
## Configuration
The service reads `%ProgramData%\mbproxy\appsettings.json` at startup and watches it for changes while running. Most settings are hot-reloadable; a save triggers a re-bind of `IOptionsMonitor<MbproxyOptions>` and a per-change-kind reconcile.
- Full schema (every `Mbproxy:*` key, defaults, validation rules, examples): [`Operations/Configuration.md`](Operations/Configuration.md).
- Per-change-kind reconcile semantics (what propagates instantly vs. what requires a restart): [`Features/HotReload.md`](Features/HotReload.md).
If a reload is rejected (`mbproxy.config.reload.rejected` in the log), the service continues running with the previous config. Fix the JSON and save again — the next valid file write is accepted.
## Logs
### Location
Rolling log files live at: `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`
One file per day, retained for 30 days by default (controlled by `retainedFileCountLimit` in the Serilog config section).
### Windows Event Log
When running as a Windows Service, the `EventLogBridge` sink writes events at Error level and above to the Windows Application Event Log under source `mbproxy`. View with:
```powershell
Get-EventLog -LogName Application -Source mbproxy -Newest 20
```
Or open Event Viewer → Windows Logs → Application, filter by source `mbproxy`.
### Log survival after uninstall
`uninstall.ps1` **never deletes log files**. It moves `logs\` to a timestamped archive at `%ProgramData%\mbproxy.archived-<timestamp>\logs\` so post-crash diagnostics remain accessible.
## Status page
**URL:** `http://<proxy-host>:<AdminPort>/` (default port 8080; change via `Mbproxy.AdminPort` in `appsettings.json`).
Routes: `GET /` (auto-refreshing HTML, no external assets) and `GET /status.json` (same data as JSON for monitoring scrapers).
The full endpoint shape, every JSON field, counter semantics, and scraping examples live in [`Operations/StatusPage.md`](Operations/StatusPage.md). KPI catalog and dashboard guidance: [`kpi.md`](kpi.md).
## Common failure modes
The full diagnosis playbook — startup bind conflicts, backend connectivity, hot-reload validation errors, BCD rewrite anomalies, performance and queue-depth issues, response-cache anomalies, and graceful-shutdown problems — is keyed to log events and status counters in [`Operations/Troubleshooting.md`](Operations/Troubleshooting.md). The complete `mbproxy.*` event catalog with levels, properties, and operator implications is in [`Reference/LogEvents.md`](Reference/LogEvents.md).
## First-install smoke checklist
Run these commands after `install.ps1 -Start` to verify the deployment:
```powershell
# 1. Service is running
Get-Service mbproxy | Select-Object Status, DisplayName
# 2. Status page is reachable
Invoke-WebRequest http://localhost:8080/ -UseBasicParsing | Select-Object StatusCode
# 3. JSON endpoint returns expected fields
$status = Invoke-RestMethod http://localhost:8080/status.json
$status.service | Select-Object version, uptimeSeconds
$status.listeners
# 4. Log file exists and is recent
Get-Item "C:\ProgramData\mbproxy\logs\mbproxy-*.log" | Sort-Object LastWriteTime -Descending | Select-Object -First 1
# 5. No Error events in the Event Log
Get-EventLog -LogName Application -Source mbproxy -EntryType Error -Newest 5
# 6. Stop the service cleanly (graceful shutdown within 10 s)
$sw = [System.Diagnostics.Stopwatch]::StartNew()
sc.exe stop mbproxy
$deadline = [DateTime]::UtcNow.AddSeconds(15)
do { Start-Sleep 1 } until ((Get-Service mbproxy).Status -eq 'Stopped' -or [DateTime]::UtcNow -gt $deadline)
$sw.Stop()
Write-Host "Stop elapsed: $($sw.ElapsedMilliseconds) ms"
(Get-Service mbproxy).Status # Should be Stopped
```
**Note:** This checklist documents the expected steps. It was not executed on a dedicated clean VM (the proxy was developed and unit/E2E tested in-process). Run this checklist on first deployment to a production host.
-179
View File
@@ -1,179 +0,0 @@
# Phase 00 — Bootstrap
Scaffold the .NET 10 Worker Service project and the test project. Wire up Generic Host, Serilog, Windows-Service registration, and `MbproxyOptions` POCOs bound via `IOptionsMonitor`. No proxy logic yet — the service starts, logs "ready", and stops cleanly.
**Depends on:** nothing. Must run alone.
**Parallel-safe with:** nothing. Phase 00 owns the initial `.csproj` and solution; subsequent phases append.
## Goal
Produce a minimal but production-shaped host that all subsequent phases plug into. The host must:
- Target `.NET 10` (`net10.0`), be registered as a Windows Service via `Microsoft.Extensions.Hosting.WindowsServices`, and also run as a console under `dotnet run` for local dev.
- Load `appsettings.json` with `reloadOnChange: true`, bind the `"Mbproxy"` section to typed POCOs, and expose them via `IOptionsMonitor<MbproxyOptions>`.
- Use Serilog with console + rolling-file sinks under `%ProgramData%\mbproxy\logs\` (configurable, but default that location).
- Set `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` and `<Nullable>enable</Nullable>` in the csproj. These stay set forever.
## Outputs (files created in this phase)
```
Mbproxy.slnx
src/Mbproxy/Mbproxy.csproj
src/Mbproxy/Program.cs
src/Mbproxy/HostingExtensions.cs # AddMbproxyOptions, AddMbproxySerilog
src/Mbproxy/Options/MbproxyOptions.cs
src/Mbproxy/Options/BcdTagOptions.cs
src/Mbproxy/Options/PlcOptions.cs
src/Mbproxy/Options/ConnectionOptions.cs
src/Mbproxy/Options/ResilienceOptions.cs
src/Mbproxy/Options/BcdTagListOptions.cs # the Global + per-PLC Add/Remove DTOs
src/Mbproxy/Workers/HeartbeatWorker.cs # one-line "service alive" worker; deleted by phase 03
src/Mbproxy/appsettings.json # minimal default with empty Plcs array
tests/Mbproxy.Tests/Mbproxy.Tests.csproj
tests/Mbproxy.Tests/HostSmokeTests.cs
tests/Mbproxy.Tests/Options/MbproxyOptionsBindingTests.cs
.gitignore # add bin/, obj/, .vs/, *.user, tests/sim/.venv/, %ProgramData%\mbproxy\
```
No other files. Phase 00 does NOT create:
- BCD codec types (phase 02)
- Proxy types (phase 03)
- Listener supervisor (phase 05)
- Status page (phase 07)
## Tasks
1. **Create `Mbproxy.slnx`** referencing the two csprojs.
2. **`src/Mbproxy/Mbproxy.csproj`** — `<Project Sdk="Microsoft.NET.Sdk.Worker">`, `TargetFramework=net10.0`, `OutputType=Exe`, `Nullable=enable`, `TreatWarningsAsErrors=true`, `ImplicitUsings=enable`. PackageReferences:
- `Microsoft.Extensions.Hosting` (latest stable for .NET 10)
- `Microsoft.Extensions.Hosting.WindowsServices`
- `Serilog.Extensions.Hosting`
- `Serilog.Settings.Configuration`
- `Serilog.Sinks.Console`
- `Serilog.Sinks.File`
- `Polly` (referenced now so phase 04/05 don't have to touch this csproj for the package; usage is deferred)
3. **`Options/MbproxyOptions.cs`** and siblings — typed POCOs that mirror the appsettings schema in [`../design.md`](../design.md) → Configuration. Keep them plain DTOs (`public sealed class` with init-only properties). Use `IValidateOptions<MbproxyOptions>` for cross-field checks at the **schema** level only (no business rules like "duplicate addresses" — those move to phase 06 along with hot-reload).
4. **`HostingExtensions.cs`** — extension methods on `IHostApplicationBuilder` named `AddMbproxyOptions(IConfiguration)` and `AddMbproxySerilog(IConfiguration)`. Keep `Program.cs` thin: read config, call the two extensions, register `HeartbeatWorker`, run.
5. **`Program.cs`** — Generic Host with `.UseWindowsService()`. `await Host.CreateApplicationBuilder(args)...Build().RunAsync()`. Honour `--console` as a no-op flag for documentation symmetry with the design (the worker SDK + UseWindowsService combo already runs in console mode under `dotnet run`).
6. **`Workers/HeartbeatWorker.cs`** — `BackgroundService` that logs `mbproxy.startup.ready` once after `Task.Delay(100)` (so Serilog has flushed) and then idles. This worker is deleted in phase 03 when the real listener supervisor takes over; it exists so phase 00's smoke test has something to assert.
7. **`appsettings.json`** — minimal, valid against the POCOs, with `Plcs: []`. Include the full key shape (`BcdTags.Global`, `AdminPort`, `Connection`, `Resilience`) so future phases just fill in values.
8. **`tests/Mbproxy.Tests/Mbproxy.Tests.csproj`** — Microsoft.NET.Sdk, `TargetFramework=net10.0`, same `Nullable`/`TreatWarningsAsErrors`. ProjectReference to `src/Mbproxy/Mbproxy.csproj`. PackageReferences:
- `Microsoft.NET.Test.Sdk`
- `xunit` (v3 if a stable release exists; v2 otherwise — record the decision in the csproj comment)
- `xunit.runner.visualstudio`
- `Shouldly`
9. **`HostSmokeTests.cs`** — build the host with `Host.CreateApplicationBuilder` against a synthetic config, start it on a `CancellationTokenSource` with a short deadline, assert it logged `mbproxy.startup.ready` and shut down without unhandled exceptions.
10. **`MbproxyOptionsBindingTests.cs`** — bind a hand-written `Dictionary<string,string>` config source into `MbproxyOptions`, assert all fields populate correctly (including a `Plcs` entry with `BcdTags.Add` and `BcdTags.Remove`).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Options;
public sealed class MbproxyOptions {
public BcdTagListOptions BcdTags { get; init; } = new();
public IReadOnlyList<PlcOptions> Plcs { get; init; } = [];
public int AdminPort { get; init; } = 8080;
public ConnectionOptions Connection { get; init; } = new();
public ResilienceOptions Resilience { get; init; } = new();
}
public sealed class BcdTagListOptions {
public IReadOnlyList<BcdTagOptions> Global { get; init; } = [];
}
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; } // 16 or 32
}
public sealed class PlcOptions {
public string Name { get; init; } = "";
public int ListenPort { get; init; }
public string Host { get; init; } = "";
public PlcBcdOverrides? BcdTags { get; init; }
}
public sealed class PlcBcdOverrides {
public IReadOnlyList<BcdTagOptions> Add { get; init; } = [];
public IReadOnlyList<ushort> Remove { get; init; } = [];
}
public sealed class ConnectionOptions {
public int BackendConnectTimeoutMs { get; init; } = 3000;
public int BackendRequestTimeoutMs { get; init; } = 3000;
}
public sealed class ResilienceOptions {
public RetryProfile BackendConnect { get; init; } = new() { MaxAttempts = 3, BackoffMs = [100, 500, 2000] };
public RecoveryProfile ListenerRecovery { get; init; } = new() {
InitialBackoffMs = [1000, 2000, 5000, 15000, 30000],
SteadyStateMs = 30000,
};
}
public sealed class RetryProfile {
public int MaxAttempts { get; init; }
public IReadOnlyList<int> BackoffMs { get; init; } = [];
}
public sealed class RecoveryProfile {
public IReadOnlyList<int> InitialBackoffMs { get; init; } = [];
public int SteadyStateMs { get; init; }
}
```
```csharp
namespace Mbproxy;
internal static class HostingExtensions {
public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder b);
public static IHostApplicationBuilder AddMbproxySerilog(this IHostApplicationBuilder b);
}
```
```csharp
namespace Mbproxy.Workers;
internal sealed class HeartbeatWorker : BackgroundService { /* logs mbproxy.startup.ready */ }
```
No other public types in this phase.
## Tests required
### Unit (`Category = Unit`, default)
1. `MbproxyOptionsBinding_BindsGlobalBcdTags_From_appsettings`
2. `MbproxyOptionsBinding_BindsPerPlcAddAndRemove`
3. `MbproxyOptionsBinding_DefaultsAreApplied_WhenSectionMissing` (AdminPort=8080, Resilience defaults)
4. `MbproxyOptionsBinding_RejectsInvalidWidth` — IValidateOptions returns Fail for `Width != 16 && Width != 32`. Schema-level only; address-overlap validation is phase 06.
5. `HostSmoke_StartsAndStops_Cleanly_AndLogs_StartupReady` — uses a Serilog sink that captures events to memory; asserts the `mbproxy.startup.ready` event fired at Information.
6. `HostSmoke_ShutdownIsOrdered` — host responds to `StopAsync` within 2 s.
### E2E (`Category = E2E`)
None in this phase. The simulator harness is phase 01.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings.
- [ ] `dotnet test --filter Category!=E2E` — all green, ≥6 tests.
- [ ] `dotnet run --project src/Mbproxy` — service starts, logs `mbproxy.startup.ready` to console within 5 s, exits cleanly on Ctrl-C.
- [ ] `appsettings.json` is a valid JSON document and parses into a populated `MbproxyOptions` instance via the test harness.
- [ ] [`../design.md`](../design.md) is unchanged (this phase introduces no new design decisions).
- [ ] Resource index entry for `docs/plan/00-bootstrap.md` is not needed (the plan README routes there).
## Out of scope
- BCD encode/decode logic (phase 02).
- TcpListener / Modbus framing / byte forwarding (phase 03).
- Polly retry pipelines (referenced as a NuGet, used starting in phase 04/05).
- Address-overlap / duplicate-port validation (phase 06).
- AdminPort HTTP endpoint (phase 07).
- Service install / uninstall scripts (phase 08).
## Notes for the subagent
- Do not create `README.md` for the tool root yet — that's a phase 08 deliverable when there's something installable to document.
- If the `xunit` v3 vs v2 question is unclear at implementation time, prefer v3 if available on NuGet — record the choice in a single-line comment at the top of the test csproj. Future phases must not silently switch.
- Use `LoggerMessage`-source-generated logging (`[LoggerMessage]`) for the heartbeat event so phases that add more log events can follow the same pattern. Set `EventId.Name = "mbproxy.startup.ready"`.
-108
View File
@@ -1,108 +0,0 @@
# Phase 01 — Simulator harness
Wrap the existing pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) as a managed lifecycle for xUnit tests. After this phase, any test class that declares `[Collection(nameof(DL205SimulatorCollection))]` gets a running pymodbus server on a known port, with skip-safe behaviour when Python is unavailable.
**Depends on:** Phase 00 (test project exists).
**Parallel-safe with:** Phase 02, Phase 03. (Touches only `tests/sim/` and `tests/Mbproxy.Tests/Sim/`. Disjoint from codec and proxy work.)
## Goal
Eliminate "did the simulator start?" as a source of flaky tests. Encode the launch / readiness-probe / shutdown / cleanup contract once, in a fixture, so phases 03 / 04 / 05 / 06 / 07 don't each reinvent it. Tests must be able to declare a dependency on the simulator and get a hot port back, OR get a clean skip if the environment can't provide one.
## Outputs
```
tests/sim/run-dl205-sim.ps1 # idempotent launcher; venv-provisioning
tests/sim/README.md # how to run the simulator standalone
tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs
tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs
tests/Mbproxy.Tests/Sim/SimulatorSmokeTests.cs # connects, sends FC03, verifies a seeded BCD register
```
Modifications:
- `.gitignore` already has `tests/sim/.venv/` from phase 00 — verify it's present.
- `tests/Mbproxy.Tests/Mbproxy.Tests.csproj` — add `NModbus` PackageReference (chosen for its small footprint and net10.0 compatibility; record the choice as a top-of-csproj comment). This is the Modbus TCP client used by tests against the simulator from this phase forward.
No other files.
## Tasks
1. **`tests/sim/run-dl205-sim.ps1`** — pure PowerShell. Parameters: `-Profile <path>` (default `../DL260/dl205.json` relative to script), `-Port <int>` (default 5020). Behaviour:
- If `tests/sim/.venv` doesn't exist: `python -m venv tests/sim/.venv`, then `tests/sim/.venv/Scripts/pip.exe install "pymodbus[server]"` pinned to a known version (record version in the script + README).
- Activate the venv (`& tests/sim/.venv/Scripts/activate.ps1`).
- Exec `pymodbus.server run --modbus-config-path <Profile> --modbus-server tcp --port <Port>`. Output streams to stdout/stderr; on script termination, the child server dies with it.
- Exit codes: 0 on clean exit, 1 on venv provisioning failure, 2 on pymodbus launch failure, 3 if the profile file is missing.
2. **`DL205SimulatorFixture : IAsyncLifetime`** —
- `InitializeAsync`: pick a free local port (bind/release a `TcpListener` on `IPEndPoint.Any:0`, capture the port, dispose). Spawn `pwsh -NoProfile -File <run-dl205-sim.ps1> -Port <picked>` via `System.Diagnostics.Process` with `RedirectStandardOutput/Error`. Poll `new TcpClient().ConnectAsync("127.0.0.1", port)` at 100 ms intervals for up to 10 s. If the simulator never accepts a connection, capture stderr tail, set `SkipReason`, and dispose the process.
- `DisposeAsync`: send Ctrl-C to the process group (`Process.Kill(entireProcessTree: true)` on Windows is the pragmatic choice — pymodbus handles SIGTERM gracefully but Windows lacks proper signals; document the tradeoff in a comment). Wait up to 5 s for exit.
- Public surface: `string Host { get; }` (always `127.0.0.1`), `int Port { get; }`, `string? SkipReason { get; }`, `string LogTail { get; }` (last ~50 lines of stderr, for diagnosis).
3. **`DL205SimulatorCollection`** —
```csharp
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
Tests that need the fixture declare `[Collection(nameof(DL205SimulatorCollection))]`.
4. **`SimulatorSmokeTests`** — `[Collection(nameof(DL205SimulatorCollection))] [Trait("Category", "E2E")]`. Three tests:
- `Simulator_AcceptsTcpConnection`
- `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — reads register 0, expects `0xCAFE` (the seeded marker from `dl205.json`). Uses NModbus directly. This proves the dl205.json profile is in fact loaded.
- `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — reads register 1072, expects raw `0x1234` (= 4660). This is the BCD register the proxy will rewrite later; phase 04's e2e test will read the SAME register through the proxy and assert 1234 instead.
5. **`tests/sim/README.md`** — a few lines: "Run `pwsh ./run-dl205-sim.ps1 -Port 5020` to launch the simulator standalone. Used by xUnit tests via `DL205SimulatorFixture`. Requires Python 3.10+; the script provisions a venv on first run."
## Public surface declared in this phase
```csharp
namespace Mbproxy.Tests.Sim;
public sealed class DL205SimulatorFixture : IAsyncLifetime {
public string Host { get; }
public int Port { get; }
public string? SkipReason { get; }
public string LogTail { get; }
public Task InitializeAsync();
public Task DisposeAsync();
}
[CollectionDefinition(nameof(DL205SimulatorCollection))]
public sealed class DL205SimulatorCollection : ICollectionFixture<DL205SimulatorFixture> { }
```
No production code is added in this phase.
## Tests required
### Unit (Category = Unit)
None in this phase. The fixture itself is a test-infrastructure component; its correctness is verified by the e2e smoke tests below.
### E2E (Category = E2E)
1. `Simulator_AcceptsTcpConnection` — open a TCP socket to `fixture.Host:fixture.Port` within the fixture lifetime.
2. `Simulator_FC03_ReturnsSeededValue_AtHR0_0xCAFE` — NModbus FC03, asserts `0xCAFE`.
3. `Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234` — NModbus FC03, asserts raw `0x1234` (4660).
When `SkipReason` is set, all three skip with `Assert.Skip(fixture.SkipReason)`. The phase gate explicitly verifies that on a machine WITH Python+pymodbus, none of them skip — skips are an environment failure, not a test pass.
## Phase gate
- [ ] `pwsh tests/sim/run-dl205-sim.ps1 -Port 5020` standalone — script provisions a venv on first run, server logs "Modbus TCP server listening" within 10 s, Ctrl-C exits cleanly.
- [ ] On second run: venv exists, script skips provisioning, server starts in < 2 s.
- [ ] On a machine WITHOUT Python: `SkipReason` is non-null and tests skip rather than fail.
- [ ] On a machine WITH Python: `SkipReason` is null, all three e2e smoke tests pass.
- [ ] `dotnet test --filter Category=E2E` is green on the dev machine.
- [ ] `dotnet test --filter Category!=E2E` still green (no regression to phase 00's tests).
- [ ] Build zero-warnings.
- [ ] `tests/sim/README.md` documents the manual launch path.
## Out of scope
- Multiple simultaneous simulators (one fixture instance is enough for all e2e tests via `ICollectionFixture`).
- Alternate profiles selected via `MODBUS_SIM_PROFILE` env var — defer until phase 04 actually needs a partial-overlap scenario; add the env-var support then.
- A C# pymodbus replacement / in-process Modbus mock. The pymodbus profile is the source of truth for DL-series quirks and we're not duplicating it.
- pip-mirror or offline-install support. CI is expected to have network or a pre-warmed venv; if a customer site needs offline install, that's a deployment concern (phase 08).
## Notes for the subagent
- Capture the chosen `pymodbus` version pin in both `run-dl205-sim.ps1` and `tests/sim/README.md` so the version isn't lost across re-provisioning.
- The free-port-picker pattern (bind on `:0`, capture port, dispose, then hand the port to the child process) has an inherent TOCTOU race — another process could grab the port between dispose and pymodbus binding. In practice this is rare; acceptable for tests. Note the trade-off in a comment.
- Pymodbus log output is verbose. Pipe it through a line buffer; only the last ~50 lines need to be available via `LogTail` for diagnosis.
- Do not commit the `.venv/` directory.
-157
View File
@@ -1,157 +0,0 @@
# Phase 02 — BCD codec
Pure logic for encoding integers as DirectLOGIC BCD nibbles and decoding nibbles back. No I/O, no network, no Modbus framing. The codec exposed by this phase is what phase 04 plugs into the proxy.
**Depends on:** Phase 00 (csproj + options POCOs).
**Parallel-safe with:** Phase 01, Phase 03. (All work lives under `src/Mbproxy/Bcd/` and `tests/Mbproxy.Tests/Bcd/` — disjoint from sim harness and proxy plumbing.)
## Goal
A tiny, allocation-free codec library that:
- Encodes a non-negative `int` (capped at the width's range) to either one 16-bit raw register value or a `(low, high)` register pair for 32-bit BCD per the design's CDAB digit-layout rule.
- Decodes one or two raw register values back to an `int`.
- Resolves `Global + per-PLC Add - per-PLC Remove` into an **immutable per-PLC `BcdTagMap`** that the rewriter looks up by Modbus address in O(1).
The codec is the single source of BCD-encoding correctness in the system. Phase 04 must not reimplement any nibble math.
## Outputs
```
src/Mbproxy/Bcd/BcdCodec.cs # static class: Encode16, Decode16, Encode32, Decode32
src/Mbproxy/Bcd/BcdTag.cs # the public record (mirrors design.md exactly)
src/Mbproxy/Bcd/BcdTagMap.cs # immutable, address-keyed lookup; describes per-PLC resolved tags
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolves global + Add - Remove into a map; runs validation
src/Mbproxy/Bcd/BcdValidationError.cs # enum + ValidationResult record
tests/Mbproxy.Tests/Bcd/BcdCodecTests.cs
tests/Mbproxy.Tests/Bcd/BcdTagMapBuilderTests.cs
```
No other files. The proxy plumbing layer doesn't exist yet and isn't touched.
## Tasks
1. **`BcdTag.cs`** — `public sealed record BcdTag(ushort Address, byte Width)` with a static factory `Create(ushort, byte)` that throws on `Width != 16 && Width != 32`. This record is the type phases 04 / 06 / 07 will use.
2. **`BcdCodec.cs`** — `internal static class` with four pure methods. Internal because the proxy is the only consumer; nothing else in the assembly should call these.
- `static ushort Encode16(int value)` — value in `[0, 9999]`; produces the 16-bit BCD register, e.g. `1234 → 0x1234`. Throws `ArgumentOutOfRangeException` if value is out of range.
- `static int Decode16(ushort raw)` — inverse. If any nibble is `>= 0xA`, return a `int.MinValue` sentinel? No — throw `FormatException` with the raw value in the message. The rewriter catches this and surfaces a `mbproxy.rewrite.invalid_bcd` event (event name added in phase 04).
- `static (ushort low, ushort high) Encode32(int value)` — value in `[0, 99_999_999]`; produces the CDAB pair, where `low` = low 4 BCD digits (least-significant) and `high` = high 4 BCD digits (most-significant). Decoded decimal = `high * 10000 + low_as_bcd_decoded`. Throws if out of range.
- `static int Decode32(ushort low, ushort high)` — inverse. Throws `FormatException` if either word has a bad nibble.
3. **`BcdTagMap.cs`** — `public sealed class BcdTagMap` wrapping a frozen address-keyed dictionary. Methods:
- `static BcdTagMap Empty { get; }`
- `bool TryGet(ushort address, out BcdTag tag)` — O(1) lookup.
- `bool TryGetForRange(ushort startAddress, ushort qty, out IEnumerable<(int offset, BcdTag tag)> hits)` — returns every BCD tag whose register footprint intersects `[startAddress, startAddress+qty)`. Offsets are relative to `startAddress`. Used by the rewriter to know which slots in a multi-register PDU to touch.
- `int Count { get; }`, `IEnumerable<BcdTag> All { get; }` — for telemetry / status page.
4. **`BcdTagMapBuilder.cs`** — given `BcdTagListOptions Global` and `PlcBcdOverrides? perPlc`, produce a `(BcdTagMap, ValidationResult)`. Validation rules from design.md:
- Reject duplicate addresses within the resolved list (Add+Global after Remove).
- Reject 32-bit entries whose high register (`Address+1`) collides with any other entry's address (16-bit or 32-bit).
- Warn on `Remove` entries that don't match any address in Global (this is not a failure; the warning rides on `ValidationResult.Warnings`).
- Reject `Width` values other than 16/32 (defensive; phase 00's `IValidateOptions` should already have caught this, but the builder is the last line of defence).
5. **`BcdValidationError.cs`** — `public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }`. `public sealed record ValidationResult(BcdTagMap Map, IReadOnlyList<BcdError> Errors, IReadOnlyList<BcdWarning> Warnings)`. Errors fail the build; warnings ride along.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Bcd;
public sealed record BcdTag(ushort Address, byte Width) {
public static BcdTag Create(ushort address, byte width);
public bool IsThirtyTwoBit => Width == 32;
public ushort HighRegister => (ushort)(Address + 1); // throws if Width != 32
}
public sealed class BcdTagMap {
public static BcdTagMap Empty { get; }
public int Count { get; }
public IEnumerable<BcdTag> All { get; }
public bool TryGet(ushort address, out BcdTag tag);
public bool TryGetForRange(ushort startAddress, ushort qty, out IReadOnlyList<RangeHit> hits);
}
public readonly record struct RangeHit(int OffsetWords, BcdTag Tag);
public static class BcdTagMapBuilder {
public static ValidationResult Build(BcdTagListOptions global, PlcBcdOverrides? perPlc);
}
public sealed record ValidationResult(
BcdTagMap Map,
IReadOnlyList<BcdError> Errors,
IReadOnlyList<BcdWarning> Warnings);
public sealed record BcdError(BcdValidationError Kind, string Message, ushort? Address);
public sealed record BcdWarning(string Message, ushort? Address);
public enum BcdValidationError { DuplicateAddress, OverlappingHighRegister, InvalidWidth }
```
```csharp
namespace Mbproxy.Bcd;
internal static class BcdCodec {
public static ushort Encode16(int value);
public static int Decode16(ushort raw);
public static (ushort low, ushort high) Encode32(int value);
public static int Decode32(ushort low, ushort high);
}
```
## Tests required
### Unit (`Category = Unit`)
`BcdCodecTests` (≥ 16 tests):
1. `Encode16_1234_Returns_0x1234`
2. `Encode16_0_Returns_0x0000`
3. `Encode16_9999_Returns_0x9999`
4. `Encode16_10000_Throws_OutOfRange`
5. `Encode16_Negative_Throws_OutOfRange`
6. `Decode16_0x1234_Returns_1234`
7. `Decode16_0x0000_Returns_0`
8. `Decode16_0x9999_Returns_9999`
9. `Decode16_0x123A_Throws_Format` — bad nibble `A`.
10. `Encode32_12345678_Returns_LowHigh_5678_1234` — verify `low = 0x5678`, `high = 0x1234`.
11. `Encode32_0_Returns_LowHigh_0_0`
12. `Encode32_99999999_Returns_LowHigh_9999_9999`
13. `Encode32_100000000_Throws_OutOfRange`
14. `Decode32_LowHigh_5678_1234_Returns_12345678`
15. `Decode32_BadNibble_InLow_Throws`
16. `Decode32_BadNibble_InHigh_Throws`
17. `RoundTrip16_AllValuesUnder10000``[Theory]` with `[InlineData]` for boundary values; for the dense check use `[Theory] [MemberData]` enumerating every 100th value. The codec must be `Decode16(Encode16(v)) == v`.
`BcdTagMapBuilderTests` (≥ 10 tests):
1. `Build_EmptyGlobal_EmptyOverride_ReturnsEmptyMap`
2. `Build_GlobalOnly_PopulatesMap`
3. `Build_PerPlcAdd_AppendsToGlobal`
4. `Build_PerPlcRemove_DropsFromGlobal`
5. `Build_AddOverrideSameAddressAsGlobal_AddWidthWins`
6. `Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError`
7. `Build_32BitHighRegOverlaps16BitGlobal_ReturnsOverlappingHighRegisterError`
8. `Build_Remove_OfNonExistentAddress_ReturnsWarning_NotError`
9. `Build_InvalidWidth_ReturnsInvalidWidthError`
10. `Map_TryGetForRange_ReturnsAllHits_InOrder` — covers full overlap, partial overlap (low only, high only), and no overlap.
### E2E (Category = E2E)
None. The codec is pure logic.
## Phase gate
- [ ] Zero-warnings build.
- [ ] `dotnet test --filter Category=Unit` — all green, ≥ 26 new tests.
- [ ] `BcdCodec` is `internal`; nothing outside `Mbproxy.Bcd` calls it directly.
- [ ] `BcdTagMap` has zero allocations on `TryGet` and on the hot `TryGetForRange` path (verify via a microbench note in the test file's docstring; no benchmark project added).
- [ ] [`../design.md`](../design.md) → "BCD tag shape" matches the public record exactly; if the spec drifted during implementation, update design.md in this PR.
## Out of scope
- Signed BCD. Design explicitly excludes it.
- Half-byte / "BCD with sign nibble" variants used by some DL-family math instructions. Not in the design's tag shape.
- The actual PDU-byte-level rewriting (FC parsing, MBAP framing). That's phase 04.
- Telemetry counters. The codec exposes nothing to counters; phase 04 instruments the rewrite pipeline that USES the codec.
## Notes for the subagent
- The DirectLOGIC CDAB digit layout is the most-likely-to-confuse part of this phase. Re-read [`../design.md`](../design.md) → "BCD tag shape" and [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Word Order" before implementing `Encode32`/`Decode32`. The seeded marker in `dl205.json` for the float32 case (`HR[1056]=0x0000, HR[1057]=0x3FC0` for IEEE 1.5) confirms low-word-first; the BCD-32 case is the same word order with BCD nibble semantics inside each word.
- `BcdTagMapBuilder` is single-shot — given inputs, produce a map. There is NO `IObservable<BcdTagMap>` here. Phase 06 owns reload-driven rebuilds and just calls `Build` again.
- `TryGetForRange` is on the hot path for FC03/04 responses. Implementation should pre-bucket BCD tags by 256-register window if it makes the lookup faster, but only if a microbench shows a real win. Don't preoptimise.
-129
View File
@@ -1,129 +0,0 @@
# Phase 03 — Proxy plumbing
The minimum-viable proxy: one `TcpListener` per configured PLC, 1:1 upstream-client ↔ backend-socket, byte-for-byte forwarding both directions, transparent MBAP TxId / unit ID. No BCD rewriting yet — that's phase 04. No supervisor / auto-recovery — that's phase 05.
**Depends on:** Phase 00 (host, options).
**Parallel-safe with:** Phase 02 (BCD codec lives under `src/Mbproxy/Bcd/`; this phase lives under `src/Mbproxy/Proxy/`).
## Goal
Stand up the listener-and-forwarder pair so an e2e test can:
1. Configure the proxy with `Plcs: [{ Host: "127.0.0.1", Port: <simPort>, ListenPort: <proxyPort> }]`.
2. Start the host.
3. Drive NModbus against `127.0.0.1:<proxyPort>` and see the SAME bytes the simulator would return on a direct connection.
The proxy is transparent in this phase. The BCD rewrite hook point is reserved but not wired.
## Outputs
```
src/Mbproxy/Proxy/PlcListener.cs # owns one TcpListener; accepts loop
src/Mbproxy/Proxy/PlcConnectionPair.cs # one upstream socket + one backend socket; forwarder
src/Mbproxy/Proxy/IPduPipeline.cs # the rewrite hook contract (no-op impl in this phase)
src/Mbproxy/Proxy/NoopPduPipeline.cs # the no-op impl
src/Mbproxy/Proxy/ProxyWorker.cs # BackgroundService that owns all PlcListeners
src/Mbproxy/Proxy/MbapFrame.cs # MBAP header parse helpers (length, txid, unit)
tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs # e2e against the simulator
tests/Mbproxy.Tests/Proxy/MbapFrameTests.cs # unit tests for the MBAP parser
```
Modifications:
- `src/Mbproxy/Program.cs` — register `ProxyWorker` as a hosted service. The `HeartbeatWorker` from phase 00 is DELETED in this phase (its job is replaced by ProxyWorker logging `mbproxy.startup.ready` after all listeners are bound).
- `src/Mbproxy/Workers/HeartbeatWorker.cs` — DELETED.
## Tasks
1. **`MbapFrame.cs`** — pure helpers, no allocations. Static methods:
- `static bool TryParseHeader(ReadOnlySpan<byte> buffer, out ushort txId, out ushort protocolId, out ushort length, out byte unitId)` — returns false if buffer.Length < 7.
- `static int TotalFrameLength(ushort lengthField)``lengthField + 6` (7 header bytes minus the 1-byte unit ID which is counted in the length field).
2. **`IPduPipeline.cs`** — the rewrite hook. Single method:
```csharp
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
```
`MbapDirection` is `RequestToBackend` or `ResponseToClient`. `PduContext` carries the per-pair state (counters, PLC name, configured tag map). In phase 03, the only implementation is `NoopPduPipeline` which does nothing.
3. **`NoopPduPipeline.cs`** — empty `Process` method. Registered as the default `IPduPipeline` in DI for this phase. Phase 04 replaces it with the real rewriter.
4. **`PlcConnectionPair.cs`** — owns the upstream `Socket` (or `TcpClient`) handed to it by `PlcListener.Accept`, opens a fresh backend socket to the configured PLC, and runs two `Task`s:
- **Upstream → backend**: read one full MBAP frame at a time (header → length → rest), call `pipeline.Process(RequestToBackend, header, pdu, ctx)`, write the frame to the backend.
- **Backend → upstream**: same shape, with `ResponseToClient`.
Either task ending (socket closed, exception, cancellation) tears down both sides cleanly. No retry loop; that's phase 05.
Backend connect is wrapped in a `try`/`catch` with the configured `BackendConnectTimeoutMs`. Connect failures close the upstream socket immediately and log `mbproxy.backend.failed`. Polly bounded retries on backend connect are **deferred to phase 05** to keep this phase scope tight — note the deferral in code with `// Phase 05: wrap in Polly pipeline`.
5. **`PlcListener.cs`** — owns one `TcpListener` for one PLC. `StartAsync` binds; on bind failure, throws (caller logs `mbproxy.startup.bind.failed` and decides what to do — phase 05 will introduce the supervisor that turns this into a recoverable state). On each accept, hands the socket to a fresh `PlcConnectionPair` and runs it on the thread-pool.
6. **`ProxyWorker.cs`** — `BackgroundService`. On start: enumerates `MbproxyOptions.Plcs`, instantiates one `PlcListener` per entry, starts them all. Each bind that succeeds logs `mbproxy.startup.bind`; each that fails logs `mbproxy.startup.bind.failed` and continues to the next PLC (matching the design's "eager, continue on per-port failure" posture). After all bind attempts, logs `mbproxy.startup.ready` with `{ ListenersBound, PlcsConfigured }`. On stop: cancels and disposes all listeners and their open pairs.
7. **`Program.cs`** — remove the HeartbeatWorker registration; register `ProxyWorker`. Also register `IPduPipeline` as a singleton `NoopPduPipeline` in DI.
## Public surface declared in this phase
All `internal sealed class` — the proxy types are not consumed outside this assembly. The only public-shaped surfaces are the `IPduPipeline` interface and the `MbapDirection` enum (so phase 04 can implement its own pipeline cleanly).
```csharp
namespace Mbproxy.Proxy;
public interface IPduPipeline {
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
}
public enum MbapDirection { RequestToBackend, ResponseToClient }
public sealed class PduContext {
public string PlcName { get; init; } = "";
// Phase 04 adds: BcdTagMap, counters, logger
}
internal sealed class NoopPduPipeline : IPduPipeline { /* no-op */ }
internal sealed class MbapFrame { /* static helpers */ }
internal sealed class PlcListener : IAsyncDisposable { /* ... */ }
internal sealed class PlcConnectionPair : IAsyncDisposable { /* ... */ }
internal sealed class ProxyWorker : BackgroundService { /* ... */ }
```
## Tests required
### Unit (`Category = Unit`)
`MbapFrameTests` (≥ 8 tests):
1. `TryParseHeader_TooShort_ReturnsFalse`
2. `TryParseHeader_ValidFrame_ParsesAllFields`
3. `TryParseHeader_ProtocolId_NotZero_StillParses` — we don't reject non-zero protocol IDs; that's the PLC's job.
4. `TotalFrameLength_LengthField7_Returns13`
5. `TotalFrameLength_LengthFieldMax_Returns_LengthFieldPlus6`
6. Round-trip: parse a known good FC03 frame and assert each field.
7. Round-trip: parse a known good FC16 write-multiple frame.
8. Negative: a frame with `length < 2` returns the parsed value but is callers' responsibility to reject. Document in a test.
### E2E (`Category = E2E`)
`ProxyForwardingTests` (≥ 5 tests, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `Forward_FC03_HR0_Returns_SimulatorRawValue_0xCAFE` — proxy is transparent; client sees the raw simulator value.
2. `Forward_FC03_HR1072_Returns_RawBCD_0x1234` — the BCD register is NOT rewritten in phase 03 (NoopPduPipeline). This test will be REPLACED in phase 04 with one that asserts `1234` instead. Document the planned replacement in a comment so phase 04's agent knows what to update.
3. `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips` — proves the write path forwards correctly.
4. `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`.
5. `MbapTxId_IsPreservedEndToEnd` — issue 20 back-to-back FC03 reads with monotonically increasing TxIds; assert every response carries the matching TxId.
6. `BackendConnectFailure_ClosesUpstreamCleanly` — point the proxy at an unreachable backend (`127.0.0.1:1`), assert the client's socket is closed within `BackendConnectTimeoutMs + 200ms`.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 00, 02 tests still green.
- [ ] All new unit tests green (≥ 8 in MbapFrameTests).
- [ ] All new e2e tests green when the simulator is available; skip cleanly when it isn't.
- [ ] `dotnet run --project src/Mbproxy` with an appsettings.json pointing at the simulator: NModbus can read/write through the proxy and gets the simulator's raw values.
- [ ] On startup with one bad and one good PLC config, the good one binds and the bad one logs `mbproxy.startup.bind.failed`, and the service does NOT abort. (Hand the supervisor work to phase 05; this phase only proves the "continue on per-port failure" posture.)
- [ ] `mbproxy.startup.ready` is now logged by `ProxyWorker`, not by a heartbeat worker. The heartbeat worker file is deleted.
## Out of scope
- BCD rewriting (phase 04 replaces `NoopPduPipeline`).
- Polly retries on backend connect (phase 05 supervisor wraps this).
- Auto-recovery for failed listener binds (phase 05).
- Counter tracking / per-PLC telemetry (phase 04 starts adding counters via `PduContext`).
- Half-MBAP-frame handling (split TCP packets): rely on `NetworkStream.ReadAsync` returning short reads; loop to fill the header (7 bytes) and then loop to fill the body (`length - 1` more bytes). Test 5 above verifies this stays correct over 20 back-to-back requests.
## Notes for the subagent
- `Socket` vs `TcpClient`: prefer `Socket` directly so framing reads can use `ReadOnlyMemory<byte>` without `NetworkStream` allocation overhead. The performance difference is small but the byte-precise API matches what the rewriter in phase 04 will need.
- Frame reads use a per-pair pooled buffer of 260 bytes (MBAP header 7 + max PDU 253). Don't allocate per-frame.
- The "Phase 04 will replace test 2" pattern is intentional. Leave breadcrumbs so the next phase's agent knows exactly which test to update; do NOT silently make the test pass against a future rewriter.
- Both forwarder tasks run with the same `CancellationTokenSource`. Cancellation propagates from listener stop → pair stop → both task ends → socket dispose.
@@ -1,146 +0,0 @@
# Phase 04 — Rewriter integration
Replace `NoopPduPipeline` with the real BCD rewriter. After this phase, FC03/FC04 responses have their configured BCD slots decoded to binary integers on the way to the client, and FC06/FC16 requests have their configured BCD slots encoded to nibbles on the way to the PLC. Counters and warnings come online here.
**Depends on:** Phase 02 (codec + tag map), Phase 03 (plumbing + `IPduPipeline`).
**Parallel-safe with:** nothing (it integrates two prior phases' outputs).
## Goal
Wire `BcdTagMap` + `BcdCodec` into the proxy at the single hook point `IPduPipeline.Process(...)`. The rewriter is responsible for:
- FC03 / FC04 responses: re-encode every covered slot from raw nibbles into a binary integer.
- FC06 / FC16 requests: re-encode every covered slot from binary integer into raw BCD nibbles.
- Partial-overlap of 32-bit pairs: pass through raw, emit `mbproxy.rewrite.partial_bcd` warning, increment partial-overlap counter.
- Bad BCD nibbles in a PLC response: pass through raw, emit `mbproxy.rewrite.invalid_bcd` (new event in this phase) at Warning, increment invalid-bcd counter. NEVER throw out of the pipeline.
- Increment per-pair counters for `pdus.forwarded`, `pdus.byFc`, `pdus.rewrittenSlots`, `pdus.partialBcdWarnings`, `pdus.invalidBcdWarnings`.
The transparency contract holds: MBAP header bytes are untouched, length field is unchanged (re-encoded slots are the same byte width), TxId / unit ID flow through.
## Outputs
```
src/Mbproxy/Proxy/BcdPduPipeline.cs # replaces NoopPduPipeline
src/Mbproxy/Proxy/PerPlcContext.cs # the per-PLC context (BcdTagMap + counters + logger)
src/Mbproxy/Proxy/ProxyCounters.cs # System.Threading.Interlocked counters
src/Mbproxy/Proxy/RewriterLogEvents.cs # [LoggerMessage] static partial methods
tests/Mbproxy.Tests/Proxy/BcdPduPipelineTests.cs # unit tests against synthetic PDU bytes
tests/Mbproxy.Tests/Proxy/RewriterE2ETests.cs # e2e against the simulator
```
Modifications:
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — replace `PduContext` (placeholder from phase 03) with `PerPlcContext`. Counters increment inline. The pipeline call site is unchanged in shape; only the context type and pipeline registration differ.
- `src/Mbproxy/Proxy/ProxyWorker.cs` — build one `PerPlcContext` per configured PLC at startup (calls `BcdTagMapBuilder.Build` and wraps the resulting map + a fresh `ProxyCounters` + a per-PLC logger). Stash the contexts in a `Dictionary<string, PerPlcContext>` keyed by PLC name.
- `src/Mbproxy/Program.cs` — register `BcdPduPipeline` as the `IPduPipeline` singleton; remove the `NoopPduPipeline` registration. The phase 03 `NoopPduPipeline.cs` file stays (it's useful in tests as a baseline) but is no longer wired in production.
- `tests/Mbproxy.Tests/Proxy/ProxyForwardingTests.cs` — update the test `Forward_FC03_HR1072_Returns_RawBCD_0x1234` (which was a phase-03 baseline) to a new test `Forward_FC03_HR1072_Returns_Decoded_1234` that asserts `1234`. The original raw-passthrough behaviour is preserved by configuring a PLC with NO BCD tags.
## Tasks
1. **`ProxyCounters.cs`** — `internal sealed class` holding `long` fields accessed via `Interlocked.Increment` / `Interlocked.Read`. Fields cover the per-PLC counter list from [`../design.md`](../design.md) → Status page → Per-PLC fields. Methods:
- `void IncrementPdusForwarded()`, `void IncrementFcCount(byte fc)`, `void AddRewrittenSlots(int n)`, `void IncrementPartialBcd()`, `void IncrementInvalidBcd()`, `void IncrementBackendException(byte code)`, `void AddBytes(long up, long down)`.
- `CounterSnapshot Snapshot()` — returns an immutable record with all the values; consumed by phase 07's status page.
2. **`PerPlcContext.cs`** — `internal sealed class` holding `string PlcName`, `BcdTagMap TagMap`, `ProxyCounters Counters`, `ILogger Logger`. Constructed once per PLC at startup; lifetime = lifetime of the listener.
3. **`BcdPduPipeline.cs`** — implements `IPduPipeline`. Behaviour per direction:
- **`RequestToBackend`**: inspect the PDU's function code byte (`pdu[0]`):
- FC06: read `(address, value)` from `pdu[1..]`. If `TagMap.TryGet(address)` and Width=16, replace value bytes with `BcdCodec.Encode16(value)`. If Width=32 and this is the LOW address, it's a single-register write to half a 32-bit tag — pass through raw + warn (the design's partial-overlap policy). If `address` is the HIGH register of a 32-bit pair, same partial-pass-through + warn. The PDU length is unchanged.
- FC16: `TryGetForRange(start, qty)`; for each hit, re-encode the relevant register-pair-or-singleton. Partial-overlap warnings emitted per offending slot.
- All other FCs: no-op.
- **`ResponseToClient`**: inspect `pdu[0]`:
- FC03 / FC04: `TryGetForRange(echoedStart, byteCount/2)`. The start address isn't in the response (Modbus FC03 response = `[fc, byteCount, ...data]`), so the rewriter needs the matching request — see Task 4.
- All other FCs: no-op.
- Exceptions from `BcdCodec.Decode*` are caught and turned into `mbproxy.rewrite.invalid_bcd` warnings; the byte is passed through unchanged.
4. **Request → response correlation.** The rewriter on a response needs the original request's start-address and quantity. Since the proxy is 1:1 per-client (no multiplexing), `PlcConnectionPair` keeps the last-issued request's `(fc, address, quantity)` in a per-pair slot. When the response arrives, the rewriter is invoked with that slot's contents as part of `PerPlcContext`. (We do NOT support pipelined multi-PDU requests on one socket in this phase; if a client tries, the slot is overwritten and the second response could mis-decode. Document the limitation; phase 08 may revisit if real clients pipeline.)
5. **`RewriterLogEvents.cs`** — `[LoggerMessage]` source-generated definitions:
- `mbproxy.rewrite.partial_bcd` — Warning, params: PlcName, Address, ClientStart, ClientQty.
- `mbproxy.rewrite.invalid_bcd` — Warning, params: PlcName, Address, RawValue, Direction.
- `mbproxy.exception.passthrough` — Information, params: PlcName, Fc, ExceptionCode. (Moved here from a phase-03 TODO.)
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy;
internal sealed class BcdPduPipeline : IPduPipeline { /* full impl */ }
internal sealed class PerPlcContext { public string PlcName; public BcdTagMap TagMap; public ProxyCounters Counters; public ILogger Logger; }
internal sealed class ProxyCounters {
public void IncrementPdusForwarded();
public void IncrementFcCount(byte fc);
public void AddRewrittenSlots(int n);
public void IncrementPartialBcd();
public void IncrementInvalidBcd();
public void IncrementBackendException(byte code);
public void AddBytes(long up, long down);
public CounterSnapshot Snapshot();
}
public sealed record CounterSnapshot(/* mirrors design.md per-PLC status fields */);
```
Nothing else becomes public.
## Tests required
### Unit (`Category = Unit`)
`BcdPduPipelineTests` (≥ 20 tests). Each test builds a synthetic PDU byte array + a `PerPlcContext` with a hand-rolled `BcdTagMap`, calls `pipeline.Process`, and asserts the resulting bytes.
Coverage matrix:
| FC | Tag scenario | Expected | Counter delta |
|----|--------------|----------|---------------|
| 03 response | single 16-bit BCD at the read address | bytes replaced with binary-encoded value | `RewrittenSlots += 1` |
| 03 response | full 32-bit BCD pair within read range | both register-bytes replaced with binary-encoded 32-bit value | `RewrittenSlots += 2` |
| 03 response | partial 32-bit (low only, qty=1 at low addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | partial 32-bit (high only, qty=1 at high addr) | bytes unchanged | `PartialBcd += 1` |
| 03 response | mixed: 16-bit + non-BCD in same read | only the 16-bit slot rewritten | `RewrittenSlots += 1` |
| 03 response | bad nibble (0x12A4) at a 16-bit BCD slot | bytes unchanged | `InvalidBcd += 1` |
| 04 response | 16-bit BCD at the read address | same as FC03 | `RewrittenSlots += 1` |
| 06 request | write to 16-bit BCD address | binary integer in payload → BCD nibbles | `RewrittenSlots += 1` |
| 06 request | write to the LOW addr of a 32-bit pair (qty=1) | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write to the HIGH addr of a 32-bit pair | bytes unchanged (partial) | `PartialBcd += 1` |
| 06 request | write value outside `[0,9999]` for 16-bit | `mbproxy.rewrite.invalid_bcd` Warning; bytes unchanged | `InvalidBcd += 1` |
| 16 request | write multi covering one 16-bit BCD + 3 non-BCD | only the 16-bit slot re-encoded | `RewrittenSlots += 1` |
| 16 request | write multi covering one full 32-bit pair | both registers re-encoded as the CDAB pair | `RewrittenSlots += 2` |
| 16 request | write multi crossing into one half of a 32-bit pair | partial slot passed through; warn | `PartialBcd += 1` |
| 01 / 02 / 05 / 15 | any | no-op | none |
| 03 exception response | exception 02 returned by PLC | bytes unchanged, no rewriting attempted | `BackendExceptions[2] += 1`, `mbproxy.exception.passthrough` logged |
Additional:
- Counter snapshot reflects increments exactly (no off-by-one).
- Empty `BcdTagMap` produces zero rewrites for any FC.
### E2E (`Category = E2E`, `[Collection(nameof(DL205SimulatorCollection))]`)
`RewriterE2ETests` (≥ 6 tests, all against the dl205.json simulator profile):
1. `Read_HR1072_AsBcd_ReturnsDecoded_1234` — configure the BCD tag at addr 1072 width 16; assert `1234`.
2. `Read_HR1072_AsRaw_WhenNotConfigured_Returns_0x1234` — no BCD tags configured; assert raw `4660`. (Verifies the pipeline is opt-in per tag.)
3. `Write_HR200_AsBcd_StoresEncoded_0x9876` — configure addr 200 width 16. Write decimal 9876 through proxy; read raw from sim, expect `0x9876` (39030).
4. `Read_HR1056_HR1057_AsBcd32_ReturnsDecoded_From_CDAB` — seed an alternate profile (or write via proxy first if the default profile's float32 markers aren't suitable BCD32 fixtures). Verify the CDAB layout end-to-end.
5. `Partial_FC03_OnHighRegisterOf_32BitPair_PassesThroughRaw_AndLogsWarning` — use the in-memory Serilog sink to verify `mbproxy.rewrite.partial_bcd` was logged.
6. `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` — same as phase 03's test 5, but with BCD rewrite in the path. Proves rewriting doesn't tamper with the MBAP header.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0003 tests still green (with the phase-03 placeholder test renamed/repurposed as described).
- [ ] All new unit tests green (≥ 16 in BcdPduPipelineTests + counter snapshot tests).
- [ ] All new e2e tests green when simulator is available.
- [ ] PDU rewriting NEVER changes the MBAP `length` field; verify in a unit test that re-encoded PDUs are exactly the same byte length as the originals.
- [ ] `ProxyCounters` is allocation-free per increment on the hot path. The `Snapshot()` call may allocate (it's used only by the status page, off the hot path).
- [ ] Log event names match [`../design.md`](../design.md) → Logging table exactly (including the new `mbproxy.rewrite.invalid_bcd` event added here — update design.md in this PR to add the row).
## Out of scope
- Auto-recovery of failed listener binds (phase 05).
- Backend-connect retry pipeline (phase 05).
- Counter exposure via HTTP (phase 07).
- Hot-reload of the per-PLC `BcdTagMap` (phase 06).
- Pipelined / multi-PDU-in-flight on a single client socket. The proxy serialises by the design's 1:1 model; if a real client pipelines, document as a known limitation.
## Notes for the subagent
- The Modbus FC03/04 response does NOT carry the start address — only the byte count and the register data. You must remember the last request's `(startAddress, quantity)` per `PlcConnectionPair`. This is fine because the proxy is 1:1 and one client = one in-flight request at a time.
- For FC16 requests, the wire format is `[fc, startHi, startLo, qtyHi, qtyLo, byteCount, ...data]`. The PDU passed to the pipeline starts at `fc`. Compute slot offsets from `startAddress + (offsetInData / 2)`.
- Update [`../design.md`](../design.md) → Logging events table to add the new `mbproxy.rewrite.invalid_bcd` event. Do this in the same PR; the doc and the code stay in sync.
- The `mbproxy.exception.passthrough` event was specified in design.md but not wired in phase 03. This phase wires it. If during phase 03 it was already wired by mistake, leave it and remove the TODO comment.
-125
View File
@@ -1,125 +0,0 @@
# Phase 05 — Listener supervisor + auto-recovery
Wrap each `PlcListener` in a Polly-backed supervisor task. Failed binds (at startup or runtime) are retried per the design's recovery profile. Backend-connect Polly retries that were deferred from phase 03 land here too.
**Depends on:** Phase 03 (PlcListener, PlcConnectionPair).
**Parallel-safe with:** nothing (changes ProxyWorker, listener lifecycle, and connection-pair connect path simultaneously).
## Goal
Eliminate "startup race lost a port, service degraded for hours" as a real failure mode. After this phase, a port temporarily in use at boot will bind once it frees; a backend connect transient failure retries within a tight budget instead of immediately dropping the upstream client.
State per listener: `bound` / `recovering` / `stopped`. Reported on the status page (phase 07) via counters and a state field.
## Outputs
```
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # owns one PlcListener; retry pipeline
src/Mbproxy/Proxy/Supervision/SupervisorState.cs # enum + state-snapshot record
src/Mbproxy/Proxy/Supervision/PolicyFactory.cs # builds Polly ResiliencePipelines from ResilienceOptions
tests/Mbproxy.Tests/Proxy/Supervision/SupervisorTests.cs # port-conflict recovery, runtime-fault recovery
tests/Mbproxy.Tests/Proxy/Supervision/BackendConnectRetryTests.cs # Polly retry on backend connect
tests/Mbproxy.Tests/Proxy/Supervision/PolicyFactoryTests.cs # unit
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — owns a `Dictionary<string, PlcListenerSupervisor>` instead of raw `PlcListener` instances. Stop/start of an individual listener now flows through the supervisor.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — backend connect now goes through a Polly pipeline built from `ResilienceOptions.BackendConnect`. Remove the `// Phase 05: wrap in Polly` TODO from phase 03.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — add `RecoveryAttempts` counter and `LastBindError` (last failure message, up to 256 chars). Update `CounterSnapshot` to include them.
- `src/Mbproxy/Proxy/RewriterLogEvents.cs` (or a sibling `SupervisorLogEvents.cs`) — add `[LoggerMessage]` definitions for `mbproxy.listener.recovered` (Info, `Plc`, `Port`, `AttemptCount`) and `mbproxy.backend.failed` (Warning, `Plc`, `Reason`). The latter event name already exists in design.md.
## Tasks
1. **`PolicyFactory.cs`** — converts `ResilienceOptions.BackendConnect` and `ResilienceOptions.ListenerRecovery` into `Polly.ResiliencePipeline` instances. Pipelines use `RetryStrategyOptions<T>` with `DelayGenerator` reading from the configured `BackoffMs` arrays. Listener recovery uses a 5-step initial backoff then steady-state at `SteadyStateMs` indefinitely (model as a custom delay generator that returns the steady-state value once the attempt index exceeds the initial array length).
2. **`SupervisorState.cs`** — `enum SupervisorState { Bound, Recovering, Stopped }` and a `record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts)`.
3. **`PlcListenerSupervisor.cs`** —
- Constructor: takes a `PlcOptions`, a `PerPlcContext`, the recovery `ResiliencePipeline`, and an `IPduPipeline`. Internally instantiates `PlcListener` lazily inside the retry loop.
- `StartAsync(CancellationToken)`: launches a supervisor task. Inside the task: call `_listener.StartAsync()`. On success, transition to `Bound`, log `mbproxy.startup.bind` (first attempt) or `mbproxy.listener.recovered` (subsequent), and `await _listener.RunAsync(ct)` — which returns when the listener accepts loop ends.
- On exception or normal-but-faulted return from the listener: transition to `Recovering`, log `mbproxy.startup.bind.failed`, increment `RecoveryAttempts`, dispose the failed listener, await Polly's next delay, retry.
- `StopAsync`: transition to `Stopped`, cancel the supervisor token, await the supervisor task.
- `Snapshot()`: returns `SupervisorSnapshot` for the status page.
4. **`PlcConnectionPair.cs` backend-connect retry** — wrap `Socket.ConnectAsync(host, port, ct)` in a `ResiliencePipeline.ExecuteAsync` built from `ResilienceOptions.BackendConnect`. After all attempts exhausted, close the upstream socket (as before) and log `mbproxy.backend.failed`. Crucial: backend-connect retries happen ONCE per upstream client connection (not per request); a connect failure terminates the pair.
5. **`ProxyWorker.cs`** — change to owning supervisors instead of raw listeners. Startup creates one supervisor per `PlcOptions`, starts them all in parallel (`await Task.WhenAll(...)` of their start tasks). The "ready" log event now fires after every supervisor has either reached `Bound` or entered `Recovering`. Shutdown stops all supervisors in parallel; clamp the total shutdown time at 5 s.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Supervision;
internal sealed class PlcListenerSupervisor : IAsyncDisposable {
public string PlcName { get; }
public Task StartAsync(CancellationToken ct);
public Task StopAsync(CancellationToken ct);
public SupervisorSnapshot Snapshot();
}
public sealed record SupervisorSnapshot(SupervisorState State, string? LastBindError, int RecoveryAttempts);
public enum SupervisorState { Bound, Recovering, Stopped }
internal static class PolicyFactory {
public static ResiliencePipeline BuildBackendConnect(RetryProfile profile, ILogger logger);
public static ResiliencePipeline BuildListenerRecovery(RecoveryProfile profile, ILogger logger);
}
```
`SupervisorSnapshot` is `public` because phase 07 (status page) consumes it. Everything else stays `internal`.
## Tests required
### Unit (`Category = Unit`)
`PolicyFactoryTests` (≥ 4 tests):
1. `BuildBackendConnect_ProducesPipeline_With3Attempts_Default`
2. `BuildBackendConnect_Backoff_MatchesConfig` — fake `TimeProvider`, assert delay sequence.
3. `BuildListenerRecovery_InitialBackoffFollowedBySteadyState` — drive 10 attempts, assert delays match.
4. `BuildBackendConnect_NoRetry_OnNonTransientException``SocketException` with WSAECONNREFUSED is retried; `ArgumentException` is not.
### Integration (`Category = Unit`; uses real sockets but no simulator)
`SupervisorTests` (≥ 5 tests):
1. `Supervisor_StartsListener_AndTransitionsToBound`
2. `Supervisor_StartFails_WhenPortInUse_TransitionsToRecovering` — bind a `TcpListener` on a free port first, then start the supervisor on the same port; assert `State == Recovering` and `LastBindError` is populated within 100 ms.
3. `Supervisor_Recovers_WhenPortFrees` — same setup as test 2, then dispose the blocking listener; assert the supervisor transitions to `Bound` and emits `mbproxy.listener.recovered` within `InitialBackoffMs[0] + 500ms`. Use an in-memory Serilog sink to verify the log event.
4. `Supervisor_RuntimeFault_TriggersRecovery` — replace the listener implementation with a faulting fake (or use reflection to force `_listener` to be one) and assert recovery kicks in.
5. `Supervisor_Stop_CleanlyTransitionsTo_Stopped_AndCancelsRetry` — supervisor in `Recovering` state, call `StopAsync`, assert it returns within 1 s without waiting out the next backoff window.
`BackendConnectRetryTests` (≥ 3 tests):
1. `BackendConnect_RetriesPerPipeline_OnConnectionRefused` — point a `PlcConnectionPair` at `127.0.0.1:1`, assert it sees exactly 3 connect attempts with the configured delays.
2. `BackendConnect_Succeeds_OnSecondAttempt_WhenBackendBecomesReachable` — start the pair against a closed port, open a listener on that port mid-backoff, assert connect succeeds and the pair runs.
3. `BackendConnect_AllAttemptsFail_ClosesUpstream` — pair gets a fresh upstream socket, never reaches a backend, the upstream socket is closed within `BackoffMs.Sum() + tolerance`.
### E2E (`Category = E2E`)
`SupervisorE2ETests` (≥ 2 tests, against the simulator):
1. `E2E_Recovery_When_BlockingListenerReleasesPort` — same shape as the unit recovery test, but with the simulator on the backend; confirms the supervisor doesn't disrupt the simulator-facing path during recovery.
2. `E2E_RecoveryAttempts_CounterIncrements_Visible_OnSnapshot` — drives the supervisor into recovery and back, then asserts `counters.RecoveryAttempts > 0`. Phase 07 will surface this on the HTTP endpoint; here we just verify the counter snapshot.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0004 tests still green.
- [ ] All new unit + integration tests green.
- [ ] E2E recovery test green when simulator is available.
- [ ] `mbproxy.listener.recovered` event log includes `AttemptCount` field.
- [ ] No deadlocks under StopAsync while supervisor is mid-backoff (verify by the test above).
- [ ] Backend-connect failures from phase 03 are now wrapped in Polly; the TODO comment from phase 03 is gone.
- [ ] [`../design.md`](../design.md) → "Listener auto-recovery" matches implementation. If during implementation the backoff arrays needed tweaking, update design.md in this PR.
## Out of scope
- Hot-reload-driven add/remove of supervisors (phase 06 owns reconcile).
- HTTP exposure of supervisor state (phase 07).
- Restart-from-crash diagnostics, Windows EventLog integration (phase 08).
- Adaptive backoff (e.g., jitter, exponential beyond the configured array). Stick to the configured schedule.
## Notes for the subagent
- Polly v8 (`Polly.Core`) is the target — `ResiliencePipeline` and `RetryStrategyOptions<T>`, not the v7 `Policy.Handle<>()` fluent API. If the package version pinned in phase 00 turns out to be v7, bump it in this phase and note the bump in the csproj comment.
- The supervisor task uses one `CancellationTokenSource` per supervisor instance. Cancelling it must cancel both the Polly delay AND the inner `_listener.RunAsync` cleanly. Polly's `ResiliencePipeline.ExecuteAsync(ct)` honours the token; double-check the listener does too.
- Do not introduce a generic "task supervisor" abstraction. `PlcListenerSupervisor` is the only thing supervising in this codebase; YAGNI on the framework.
- The supervisor must NOT swallow exceptions from `_listener.RunAsync` other than `OperationCanceledException`. Log them at Warning with the exception, then enter the recovery loop. Operators reading logs need to see WHY a listener died, not just that it was restarted.
-158
View File
@@ -1,158 +0,0 @@
# Phase 06 — Configuration hot-reload
Subscribe to `IOptionsMonitor<MbproxyOptions>.OnChange` and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host.
**Depends on:** Phase 05 (supervisor lifecycle).
**Parallel-safe with:** nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options).
## Goal
A `appsettings.json` save propagates per the design's reconcile table:
| Change | Action |
|--------|--------|
| `BcdTags.Global` add/remove/width | Rebuild every PLC's `BcdTagMap`, swap atomically. Next PDU sees it. |
| `Plcs[i].BcdTags.{Add,Remove}` | Rebuild that PLC's `BcdTagMap` only. |
| New `Plcs[i]` | Create supervisor + context, start it. |
| Removed `Plcs[i]` | Stop supervisor, close all client connections to it. |
| Changed `ListenPort` / `Host` | Stop + start the supervisor (remove + add semantics). |
| `Connection.Backend*TimeoutMs` | Take effect on the next backend connect / request. |
| Invalid reload | Reject as a whole; keep current state; log `mbproxy.config.reload.rejected`. |
Validation runs FIRST. A reload that would produce duplicate `ListenPort` values, or a `BcdTagMapBuilder.Build` error for any PLC, is rejected atomically before any state mutates.
## Outputs
```
src/Mbproxy/Configuration/ConfigReconciler.cs # OnChange handler; orchestrates the apply
src/Mbproxy/Configuration/ReloadValidator.cs # cross-PLC validation (duplicate ports, etc.)
src/Mbproxy/Configuration/ReloadPlan.cs # immutable diff record between current and new
tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs
tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs
tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs # real appsettings.json mutation, real host
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — accept a `ConfigReconciler` and forward `IOptionsMonitor.OnChange` to it; on startup, also seed the reconciler with the initial snapshot.
- `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs` — expose a `Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)` that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.)
- Add `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` `[LoggerMessage]` events.
- `src/Mbproxy/Options/MbproxyOptions.cs` — wire `IValidateOptions<MbproxyOptions>` to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by `ReloadValidator` because it requires inspecting multiple `Plcs[i]` together, which `IValidateOptions` doesn't naturally express.
## Tasks
1. **`ReloadPlan.cs`** — immutable record describing the diff:
```csharp
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove, // PLC names
IReadOnlyList<(string Name, PlcOptions New)> ToRestart, // port or host changed
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, // tag map changed
ConnectionOptions Connection);
```
Computed by a pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`; PLC identity is keyed on `Name` (NOT on `ListenPort`, which is mutable).
2. **`ReloadValidator.cs`** — single static method `Validate(MbproxyOptions next, out IReadOnlyList<string> errors)`:
- PLC names are unique and non-empty.
- `ListenPort` values are unique.
- For each PLC, `BcdTagMapBuilder.Build(global, perPlc).Errors` is empty.
- `AdminPort` doesn't collide with any `Plcs[i].ListenPort`.
- All ports are in `[1, 65535]`.
3. **`ConfigReconciler.cs`** — subscribes via constructor-injected `IOptionsMonitor<MbproxyOptions>.OnChange`. On change:
- Snapshot the new options.
- Run `ReloadValidator.Validate`. On failure: log `mbproxy.config.reload.rejected` with the error list; do nothing else.
- Compute `ReloadPlan` against the current snapshot.
- Apply the plan in order:
1. Stop supervisors in `ToRemove` (concurrently).
2. Stop+restart supervisors in `ToRestart` (concurrently).
3. Build new `PerPlcContext` for each `ToReseat` entry and call `supervisor.ReplaceContextAsync(newCtx)`.
4. Build supervisors for `ToAdd`, start them.
- On success: log `mbproxy.config.reload.applied` with summary (`PlcsAdded`, `PlcsRemoved`, `PlcsReseated`, `TagListDelta`). Record `lastReloadUtc` and bump `reloadCount` on a service-wide counter (consumed by phase 07).
- On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.)
4. **`ProxyWorker.cs`** updates — register the reconciler with the host and wire startup to use it for the initial snapshot.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Configuration;
internal sealed class ConfigReconciler : IDisposable {
public ConfigReconciler(IOptionsMonitor<MbproxyOptions> monitor, /* dependencies */);
public Task ApplyAsync(MbproxyOptions next, CancellationToken ct); // exposed for tests
public void Dispose();
}
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove,
IReadOnlyList<(string Name, PlcOptions New)> ToRestart,
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,
ConnectionOptions Connection) {
public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next);
}
internal static class ReloadValidator {
public static bool Validate(MbproxyOptions next, out IReadOnlyList<string> errors);
}
```
## Tests required
### Unit (`Category = Unit`)
`ReloadValidatorTests` (≥ 6 tests):
1. `Validate_DuplicatePlcName_Fails`
2. `Validate_DuplicateListenPort_Fails`
3. `Validate_AdminPortCollidesWith_PlcListenPort_Fails`
4. `Validate_PerPlc_BcdMapBuildError_Fails`
5. `Validate_PortOutOfRange_Fails`
6. `Validate_HappyPath_Passes`
`ReloadPlanTests` (≥ 5 tests):
1. `Compute_AddOnePlc_OnlyToAddPopulated`
2. `Compute_RemoveOnePlc_OnlyToRemovePopulated`
3. `Compute_ChangePort_GoesToToRestart_NotToReseat`
4. `Compute_ChangePerPlcTagOverride_GoesToToReseat`
5. `Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart`
`ConfigReconcilerTests` (≥ 4 tests, using a fake `IOptionsMonitor` + fake supervisor factory):
1. `Apply_HappyPath_StartsAndStopsSupervisors_PerPlan`
2. `Apply_ValidationFails_NoMutationOccurs_AndLogsRejected`
3. `Apply_ReseatTagMap_DoesNotRestartSupervisor`
4. `Apply_ConcurrentReloads_Are_Serialised` — two rapid changes get processed in order, no interleaving.
### E2E (`Category = E2E`)
`HotReloadE2ETests` (≥ 4 tests, using a real `Host.CreateApplicationBuilder` + temp appsettings.json file):
1. `E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable` — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s.
2. `E2E_RemovePlcAtRuntime_ClosesUpstreamConnections` — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s.
3. `E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234.
4. `E2E_InvalidReload_DoesNotMutateRunningState` — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and `mbproxy.config.reload.rejected` is logged.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0005 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e hot-reload tests green when the simulator is available.
- [ ] `mbproxy.config.reload.applied` / `.rejected` events match the design's properties list.
- [ ] A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation.
- [ ] The reconciler serializes concurrent `OnChange` notifications (`SemaphoreSlim` or equivalent) so two file saves in quick succession don't race.
- [ ] Counters `service.config.reloadCount` and `service.config.reloadRejectedCount` are bumped correctly.
## Out of scope
- Watching for files OTHER than `appsettings.json` (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract.
- Reloading Serilog log levels at runtime. Possible but not in this phase.
- A reload audit log file. The accept/reject events are sufficient.
- Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract.
## Notes for the subagent
- `IOptionsMonitor.OnChange` can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last `OnChange` before computing the plan. Document the choice in code.
- The reconciler must NOT block the `OnChange` callback thread for I/O (`StopAsync` etc.). Use `Channel<ReloadRequest>` or a `Task.Run`-style hand-off so the callback returns immediately.
- When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it.
- `BcdTagMapBuilder.Build` is the validator for tag-list well-formedness; do not duplicate that validation in `ReloadValidator`. The validator just calls `Build` and checks the `Errors` list.
-147
View File
@@ -1,147 +0,0 @@
# Phase 07 — Status page
Stand up the read-only Kestrel-hosted admin endpoint on `Mbproxy.AdminPort`. Two routes — `GET /` (self-contained HTML, meta-refresh 5 s) and `GET /status.json` (the same data as JSON). No admin actions, no auth.
**Depends on:** Phase 05 (supervisor snapshots), Phase 06 (config reload counters).
**Parallel-safe with:** nothing (touches DI registration + needs counters from both 05 and 06).
## Goal
A single port that an operator can open in a browser and see, at a glance:
- Service uptime, version, last-reload timestamp + counts.
- Every configured PLC's listener state (`bound` / `recovering` / `stopped`), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream.
Same data is exposed as `/status.json` for scraping (Prometheus textfile, custom Nagios check, etc.).
## Outputs
```
src/Mbproxy/Admin/AdminEndpointHost.cs # owns the Kestrel server lifecycle
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # composes per-PLC + service-wide snapshots
src/Mbproxy/Admin/StatusDto.cs # the wire DTOs for /status.json
src/Mbproxy/Admin/StatusHtmlRenderer.cs # builds the single-page HTML
src/Mbproxy/Admin/AssemblyVersionAccessor.cs # cached version string
tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs
tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs # HTTP-level; live Kestrel + HttpClient
```
Modifications:
- `src/Mbproxy/Mbproxy.csproj` — add `Microsoft.AspNetCore.App` framework reference (the Worker SDK doesn't include ASP.NET Core by default).
- `src/Mbproxy/Program.cs` — register `AdminEndpointHost` as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from `IOptionsMonitor<MbproxyOptions>`.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — extend with per-client counters: `IReadOnlyList<ClientCounterSnapshot> Snapshot()` includes connected clients with `Remote`, `ConnectedAtUtc`, `PdusForwarded`, `LastRoundTripMs`.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — record connect time, expose `RemoteEndpoint`, track round-trip time per request (EWMA via `LastRoundTripMs` field).
- Service-wide counters introduced here: `ServiceCounters` with `UptimeStartedAtUtc`, `LastReloadUtc`, `ReloadCount`, `ReloadRejectedCount`. Wired into `ConfigReconciler` (bump on apply / reject) and the service start path (set started-at).
## Tasks
1. **`StatusDto.cs`** — record types matching the design's per-PLC + service-wide field tables verbatim. Use `System.Text.Json` source generation (`JsonSerializerContext`) to keep the response allocation-light:
```csharp
[JsonSerializable(typeof(StatusResponse))]
internal partial class StatusJsonContext : JsonSerializerContext;
```
2. **`StatusSnapshotBuilder.cs`** — pulls from injected `ProxyWorker` (or a slim view of it), `ConfigReconciler`, `ServiceCounters`, and each `PlcListenerSupervisor`. Builds a `StatusResponse` record. Pure logic; no I/O. The builder is `[Sealed]` and constructed once via DI; calling `Build()` is the only operation.
3. **`StatusHtmlRenderer.cs`** — pure function `string Render(StatusResponse status)`. Produces a single HTML document with:
- `<meta http-equiv="refresh" content="5">` for auto-refresh.
- A header line with service version + uptime + last-reload info.
- A table per PLC. Columns match the per-PLC field set; `listener.state` is colour-coded inline (CSS in a `<style>` block — no external assets).
- Total page weight under 50 KB for typical fleets; the design's 54-PLC count puts the table at ~54 rows.
4. **`AssemblyVersionAccessor.cs`** — reads `AssemblyInformationalVersionAttribute` once at startup, caches it as a string. Used for the `service.version` field.
5. **`AdminEndpointHost.cs`** — `IHostedService` that:
- On start: builds a `WebApplication` (Kestrel) configured to listen on `AdminPort`. Maps `GET /` to a handler that calls `StatusSnapshotBuilder.Build()` then `StatusHtmlRenderer.Render()`, returning `text/html`. Maps `GET /status.json` to a handler returning `JsonSerializer.Serialize(snapshot, StatusJsonContext.Default.StatusResponse)`. NO other routes.
- If `AdminPort` is in use at startup: log `mbproxy.admin.bind.failed` (new event) at Error, do not throw. The proxy listeners continue to run; only the admin endpoint is missing. Operators see this in logs.
- On hot-reload of `AdminPort`: stop and restart the Kestrel server bound to the new port.
- On stop: `Stop()` the Kestrel app gracefully with a 2 s deadline.
6. **`ServiceCounters.cs`** (under `src/Mbproxy/`) — a singleton DI service holding the service-wide counters. `Initialize(DateTimeOffset startedAtUtc)`; `RecordReloadApplied(DateTimeOffset)`; `RecordReloadRejected()`. Snapshot returns an immutable record.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Admin;
internal sealed class AdminEndpointHost : IHostedService { /* ... */ }
public sealed record StatusResponse(
ServiceFields Service,
ListenersAggregate Listeners,
IReadOnlyList<PlcStatus> Plcs);
public sealed record ServiceFields(
long UptimeSeconds, string Version,
DateTimeOffset? ConfigLastReloadUtc, int ConfigReloadCount, int ConfigReloadRejectedCount);
public sealed record ListenersAggregate(int Bound, int Configured);
public sealed record PlcStatus(
string Name, string Host, int ListenPort,
PlcListenerStatus Listener,
PlcClientsStatus Clients,
PlcPdusStatus Pdus,
PlcBackendStatus Backend,
PlcBytesStatus Bytes);
public sealed record PlcListenerStatus(string State, string? LastBindError, int RecoveryAttempts);
public sealed record PlcClientsStatus(int Connected, IReadOnlyList<ClientSnapshot> RemoteEndpoints);
public sealed record ClientSnapshot(string Remote, DateTimeOffset ConnectedAtUtc, long PdusForwarded);
public sealed record PlcPdusStatus(long Forwarded, FcCounts ByFc, long RewrittenSlots, long PartialBcdWarnings);
public sealed record FcCounts(long Fc03, long Fc04, long Fc06, long Fc16, long Other);
public sealed record PlcBackendStatus(long ConnectsSuccess, long ConnectsFailed, ExceptionCounts ExceptionsByCode, double LastRoundTripMs);
public sealed record ExceptionCounts(long Code01, long Code02, long Code03, long Code04);
public sealed record PlcBytesStatus(long UpstreamIn, long UpstreamOut);
```
## Tests required
### Unit (`Category = Unit`)
`StatusSnapshotBuilderTests` (≥ 6 tests):
1. `Build_NoPlcsConfigured_ReturnsEmptyPlcList`
2. `Build_OnePlcBound_PopulatesListenerState_Bound`
3. `Build_PlcRecovering_PopulatesLastBindError_AndAttempts`
4. `Build_AggregatesListenersBoundAndConfigured`
5. `Build_PerClientSnapshot_Includes_RemoteAndConnectedAt_AndPduCount`
6. `Build_ServiceFields_IncludeUptime_Version_AndLastReload`
`StatusHtmlRendererTests` (≥ 3 tests):
1. `Render_OnePlc_ProducesValidHtml_WithMetaRefresh`
2. `Render_RecoveringPlc_HighlightsState`
3. `Render_PageWeightUnder50KB_For54Plcs` — assert character length.
### E2E (`Category = E2E`)
`AdminEndpointTests` (≥ 5 tests, against a live in-process Kestrel + simulator):
1. `Get_StatusJson_ReturnsValidShape`
2. `Get_StatusJson_AfterReadFC03_ShowsPduCountIncreased`
3. `Get_StatusJson_AfterPartialBcdWrite_ShowsPartialBcdWarning`
4. `Get_Root_ReturnsHtml_WithMetaRefresh`
5. `AdminPort_BindFailure_ServiceStaysUp_AndLogsBindFailed` — pre-bind the AdminPort, start the service, assert proxy listeners come up and the admin endpoint logs the failure.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0006 tests still green.
- [ ] All new unit + e2e tests green.
- [ ] `/status.json` shape matches the field tables in [`../design.md`](../design.md) → "Status page" exactly (field names, casing, nesting).
- [ ] Counters on the read path (`PdusForwarded`, etc.) remain allocation-free; `Snapshot()` is the only allocating call and it's on the cold path.
- [ ] AdminPort collision is logged but does NOT take down the proxy.
- [ ] Hot-reload of `AdminPort` works (verified by adding a test in this phase or extending one of phase 06's e2e tests).
## Out of scope
- Authentication / authorisation on the admin port. Design explicitly defers to network-layer trust.
- Prometheus exposition format. The `/status.json` shape is the contract; downstream tools can transform.
- WebSocket push of counters. Meta-refresh is good enough at 54 PLCs.
- Historical counter retention (rolling windows, time series). Counters are cumulative since process start; restart resets.
- Per-tag-level telemetry (which BCD addresses got rewritten how often). The per-PLC `RewrittenSlots` total is enough; finer granularity goes in a future phase if needed.
## Notes for the subagent
- Use the minimal-API style for the two endpoints; no controllers. The whole admin endpoint is ~50 lines of map / handler code.
- `System.Text.Json` source generation needs `[JsonSerializable]` on the DTO chain. Don't use reflection-based serialization in this codebase — it adds AOT-unsafety and is slower for the simple shape.
- For the HTML page, embed CSS in a `<style>` block. Do not link external stylesheets — the admin endpoint must work over a firewalled network with no internet egress.
- Test 3 of `AdminEndpointTests` requires triggering a partial-BCD warning, which means configuring a 32-bit BCD tag and reading only one half of it through the proxy. This is the same scenario phase 04's e2e test 5 exercised; reuse the setup.
- The admin port collision test is important: an operator misconfiguration must not take down the proxy itself. Log Error, continue running.
-134
View File
@@ -1,134 +0,0 @@
# Phase 08 — Windows service hardening
Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing `README.md` that the root `wwtools/CLAUDE.md` index points at. This is the "ship it" phase.
**Depends on:** Phase 04 (rewriter), Phase 07 (status page).
**Parallel-safe with:** nothing.
## Goal
After this phase, an operator can:
1. `dotnet publish` the service into a self-contained folder.
2. Run `install.ps1` to register it as a Windows service.
3. See it appear in `services.msc` running as `Local System` (default — overridable to a managed service account).
4. Stop it cleanly via `sc.exe stop mbproxy`; the service finishes all in-flight PDUs and exits within 10 s.
5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
6. Read [`../../mbproxy/README.md`](../../mbproxy/README.md) to figure all of this out without needing to talk to a developer.
## Outputs
```
mbproxy/README.md # tool-level human entry point (per DOCS-GUIDE Layer 2)
mbproxy/install/install.ps1 # registers the service
mbproxy/install/uninstall.ps1 # removes it
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
mbproxy/docs/operations.md # ops runbook (install, upgrade, troubleshooting)
src/Mbproxy/Diagnostics/ShutdownCoordinator.cs # graceful-shutdown helper
src/Mbproxy/Diagnostics/EventLogBridge.cs # logs critical events to Windows Event Log
tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs
```
Modifications:
- `src/Mbproxy/Program.cs` — wire `ShutdownCoordinator` into the host-stop signal. Wire `EventLogBridge` as a Serilog sub-sink for events at Error and above when running under Windows Service (`WindowsServiceHelpers.IsWindowsService()` true).
- `mbproxy/Mbproxy.csproj``<PublishSingleFile>true</PublishSingleFile>` and `<SelfContained>true</SelfContained>` for the publish profile.
- `../CLAUDE.md` (the root `wwtools/CLAUDE.md`) — update the `mbproxy` index row to point at the new `mbproxy/README.md` (per the maintenance note in `mbproxy/CLAUDE.md`).
- `mbproxy/CLAUDE.md` — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.
## Tasks
1. **`mbproxy/README.md`** — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to `docs/design.md`, `docs/plan/README.md`, `docs/operations.md`, `CLAUDE.md`. No deep prose tutorials; the README routes.
2. **`mbproxy/install/install.ps1`** — parameters: `-InstallPath <path>` (default `C:\Program Files\Mbproxy`), `-ServiceName <name>` (default `mbproxy`), `-DisplayName <text>`, `-Account <managed-service-account>` (default `LocalSystem`). Behaviour:
- Verifies admin rights; fails with a clear message if not elevated.
- Copies the publish output (passed via `-PublishOutput <path>`) to `InstallPath`.
- Runs `sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>`.
- Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (`sc.exe failure ...`).
- Creates `%ProgramData%\mbproxy\logs\` with appropriate ACLs.
- Copies `mbproxy.config.template.json` to `%ProgramData%\mbproxy\appsettings.json` if no config exists.
- Optionally starts the service if `-Start` flag is passed.
3. **`mbproxy/install/uninstall.ps1`** — stops the service if running, `sc.exe delete <ServiceName>`, removes `InstallPath` (with `-KeepConfig` flag to preserve `%ProgramData%\mbproxy\appsettings.json`).
4. **`mbproxy/install/mbproxy.config.template.json`** — a fully commented `appsettings.json` showing the full schema with example values and inline `//` comments describing every field. (Use `appsettings.jsonc` semantics; .NET's configuration loader tolerates `//` comments when configured to.)
5. **`ShutdownCoordinator.cs`** — orchestrates graceful shutdown on `IHostApplicationLifetime.ApplicationStopping`:
- Stop accepting new upstream connections on all `PlcListenerSupervisor`s.
- Wait for in-flight PDUs to complete with a `10 s` deadline (configurable via `Connection.GracefulShutdownTimeoutMs`, default 10000).
- Stop the admin endpoint.
- Cancel all remaining work. Log `mbproxy.shutdown.complete` with `InFlightAtCancel` count.
6. **`EventLogBridge.cs`** — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source `mbproxy`. Only enabled when running as a Windows Service. The install script creates the event source.
7. **`mbproxy/docs/operations.md`** — operations runbook:
- Install / uninstall steps (mirror to `README.md`).
- Upgrade procedure (stop service, copy new binaries, start).
- Where logs live, how to roll them, retention defaults.
- Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
- The `services.msc` / `sc.exe` / `Get-Service` commands operators will actually use.
- How to safely edit `appsettings.json` for hot-reload (with the rejection-keeps-old-config promise).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Diagnostics;
internal sealed class ShutdownCoordinator {
public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
}
internal sealed class EventLogBridge { /* Serilog sub-sink */ }
```
No additional public types are needed; all surfaces from previous phases remain stable.
## Tests required
### Unit (`Category = Unit`)
`ShutdownCoordinatorTests` (≥ 4 tests):
1. `Shutdown_NoActiveConnections_CompletesImmediately`
2. `Shutdown_OneActiveConnection_WaitsForCompletion`
3. `Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount`
4. `Shutdown_AdminEndpointStopped_AfterListenersStopped` — ordering test.
### E2E (`Category = E2E`)
`ShutdownE2ETests` (≥ 2 tests, against simulator):
1. `E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s` — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
2. `E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` — same but with a `Connection.BackendRequestTimeoutMs` that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.
### Manual / smoke
- Install the service via `install.ps1` on a clean test VM; confirm it appears in `services.msc` with `Local System` identity.
- `sc.exe start mbproxy` — service starts, admin endpoint at `http://localhost:8080/` shows the proxy is up.
- Send `sc.exe stop mbproxy` — service stops within 10 s.
- Trigger a crash (e.g., corrupt `appsettings.json` while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source `mbproxy`.
- `uninstall.ps1` — service removed cleanly; `%ProgramData%\mbproxy\` preserved unless `-KeepConfig` was not passed.
The manual smoke results go into `docs/operations.md` as a "first install" verification checklist.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0007 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e shutdown tests green.
- [ ] `mbproxy/README.md` exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
- [ ] Root `wwtools/CLAUDE.md` index row for `mbproxy` points at `mbproxy/README.md` (was previously pointing into the design plan or the bare folder).
- [ ] `install.ps1` and `uninstall.ps1` are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
- [ ] Windows Event Log source is created during install and removed during uninstall.
- [ ] `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true` produces a single executable under 50 MB.
- [ ] Manual smoke checklist in `docs/operations.md` has been executed on at least one fresh VM and the result documented.
## Out of scope
- Linux / Docker packaging. The design fixes Windows Service as the deployment target.
- Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
- A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
- Metric exposition for Prometheus / OpenTelemetry. The status page's `/status.json` is sufficient for the operational needs declared in the design.
## Notes for the subagent
- The Windows Event Log source creation requires admin rights — that's already a precondition for `install.ps1`. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
- Single-file publish makes `Assembly.GetExecutingAssembly().Location` empty. If `AssemblyVersionAccessor` (phase 07) used that, swap to `Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>()`.
- The `mbproxy/README.md` is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
- After this phase merges, the project is feature-complete against [`../design.md`](../design.md). Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.
-341
View File
@@ -1,341 +0,0 @@
# Phase 09 — MBAP TxId multiplexing (single backend connection per PLC)
Replace the 1:1 upstream-client ↔ backend-socket model with a **single backend connection per PLC**, multiplexed across all upstream clients via MBAP transaction-ID rewriting and a correlation map. After this phase the H2-ECOM100's 4-simultaneous-TCP-client cap is no longer an operational ceiling — the proxy holds exactly one slot per PLC regardless of how many upstream clients are connected.
**Status:** shipped 2026-05-14. Phases 00-08 shipped the production-ready 1:1 model; this phase swapped connection management without changing the transparent-rewrite contract.
## Implementation clarifications discovered during 2026-05-14 ship
These notes capture decisions and surprises that surfaced during the actual implementation. They supplement (not replace) the Tasks section below.
1. **A per-request timeout watchdog is part of Phase 9, not deferred.** The 1:1 model collapsed missing-response handling onto the dedicated backend socket dying. The multiplexed model needs an explicit timer because a single lost or mis-routed response would otherwise leak a correlation entry forever and hang the upstream pipe indefinitely. The watchdog ticks at quarter-`BackendRequestTimeoutMs` (min 100 ms), scans the correlation map, and times out stale requests with **Modbus exception 0x0B (Gateway Target Device Failed To Respond)** delivered to the upstream party with the original TxId restored. Log event `mbproxy.multiplex.request.timeout` (Warning).
2. **PlcListener constructs a multiplexer unconditionally.** The Phase-9 draft had `PlcListener` conditionally construct the multiplexer only when a `PerPlcContext` was supplied; the no-context fallback dropped accepted upstream sockets. Tests (and any pre-Phase-6 startup path that lacked a context) hit a regression. The fix is to construct a minimal default `PerPlcContext` from the `PlcOptions` if the caller didn't supply one, and require `_multiplexer` to be non-null when `RunAsync` runs.
3. **`BackendConnectFailure_ClosesUpstreamCleanly` is now lazy.** The 1:1 model attempted a backend connect at upstream-accept time, so simply opening a TCP connection to a proxy with a bad backend triggered the close. The multiplexed model connects to the backend on the *first upstream frame*, so the test has to send a Modbus request before the proxy attempts the (failing) backend connect that causes the upstream close. Updated in-place.
4. **pymodbus 3.13.0 simulator is broken under multiplexed concurrent requests.** Its `ServerRequestHandler` keeps a single `last_pdu` per connection and schedules `handle_later` via `asyncio.call_soon`; two MBAP frames in one recv buffer overwrite `last_pdu` before the first handler runs, and both responses carry the later TxId. The real DL260 ECOM properly echoes per-request TxIds. Consequence for tests:
- **Mux correctness under truly concurrent backend traffic is proven against the stub backend in `PlcMultiplexerTests`**, which models the DL260's correct TxId-echo behaviour.
- **`MultiplexerE2ETests` paces requests** so pymodbus only ever sees one MBAP frame at a time on the shared backend connection. The headline test (`E2E_FiveSimultaneousClients_AllReadHR1072_AllGetDecoded_1234`) verifies the connection ceiling lift (5 simultaneous upstream connections, where Phase-08's 1:1 model would have refused the 5th) — *not* the under-concurrency multiplexing behaviour.
- **The watchdog is the production defence** if any real backend (or future simulator) ever mis-echoes a TxId: stale entries time out cleanly with exception 0x0B rather than hanging upstream clients.
5. **E2E timeouts.** Per `docs/plan/README.md`'s Test discipline, all E2E tests are 5 s by default. Hot-reload tests that genuinely need 5 s + 3 s of propagation windows carry a 10 s timeout with a one-line comment; `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` carries 8 s for its sequential connects + Polly-paced reconnect path.
6. **`AsyncHostDispose` deadlock note.** Test fixtures that hold `IHost` via `await using` were originally written with a 5 s shutdown timeout; under Phase 9's drained-channel cleanup that occasionally exceeded the test's own `Timeout = 5000`. Reduced to 2-3 s where it doesn't materially affect the test's drain semantics.
**Depends on:** Phase 04 (rewriter), Phase 05 (supervisor + Polly), Phase 07 (status page DTO surface).
**Parallel-safe with:** nothing within itself. **Hard rule.** This phase deletes `PlcConnectionPair` and rewires the supervisor + rewriter correlation path simultaneously; the cross-cut is too broad for safe parallel work. The optional intra-phase slicing (below) is the closest thing to parallel.
## Goal
The H2-ECOM100 accepts 4 concurrent TCP clients per PLC; today's 1:1 model means the 5th upstream client to the same proxy port fails at backend connect. This phase eliminates that ceiling by making **one persistent backend socket per PLC**, with the proxy serving as a connection multiplexer that rewrites MBAP transaction IDs to keep concurrent in-flight requests from different upstream clients distinguishable on the single wire.
The wire-rate ceiling does not change — the H2-ECOM100 internally serializes requests (one per PLC scan, ~2-10 ms scan time) regardless of how many TCP connections it has. We're shifting where serialization happens (proxy outbound queue vs PLC accept queue), not adding throughput. The dashboard pay-off is that "PLC clients connected" can rise into the dozens without the proxy degrading.
## Intra-phase slicing (the closest thing to parallel-safe within this phase)
The phase is one merge but can be implemented as five small commits in this order:
| Slice | Output | Files touched | Hours | Parallelizable? |
|-------|--------|---------------|-------|-----------------|
| 9.1 | Pure data types (TxIdAllocator, CorrelationMap, InFlightRequest) + their unit tests | new files under `src/Mbproxy/Proxy/Multiplexing/` and `tests/...` | ~5 | Yes — pure logic, disjoint from rest. A second agent can write the E2E test scaffolding (slice 9.5) in parallel. |
| 9.2 | `PlcMultiplexer` + `UpstreamPipe` skeleton with backend reader/writer loops | new files in `Multiplexing/` | ~10 | No — depends on 9.1's data types. |
| 9.3 | Refactor `PlcListener` to own the multiplexer; delete `PlcConnectionPair`; rewire supervisor | modifies existing Proxy + Supervision files | ~8 | No — depends on 9.2. |
| 9.4 | Update `BcdPduPipeline` to use correlation entries (drop `PerPlcContextWithRequest`); counter additions; status DTO + HTML updates | modifies pipeline + admin files | ~6 | No — depends on 9.3. |
| 9.5 | Full E2E test suite + design.md + CLAUDE.md doc updates | new test file + doc edits | ~6 | Test-writing yes (slice 9.5 skeleton can land in parallel with 9.1); the doc edits at the end are sequential after 9.3. |
**Total:** ~35 hours. With one parallel agent producing slice 9.1's data types and another sketching the e2e test fixtures during slice 9.5-prep, calendar time can compress to ~28 hours.
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # single backend conn owner; mux logic
src/Mbproxy/Proxy/Multiplexing/UpstreamPipe.cs # per-upstream-client reader/writer
src/Mbproxy/Proxy/Multiplexing/TxIdAllocator.cs # 16-bit allocator with wrap tracking
src/Mbproxy/Proxy/Multiplexing/CorrelationMap.cs # proxyTxId → InFlightRequest
src/Mbproxy/Proxy/Multiplexing/InFlightRequest.cs # the correlation record
src/Mbproxy/Proxy/Multiplexing/MultiplexerLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/TxIdAllocatorTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/CorrelationMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/PlcMultiplexerTests.cs # integration, real sockets
tests/Mbproxy.Tests/Proxy/Multiplexing/RewriterCorrelationTests.cs # rewriter w/ multiplexed paths
tests/Mbproxy.Tests/Proxy/Multiplexing/MultiplexerE2ETests.cs # against pymodbus sim
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/PlcListener.cs # owns PlcMultiplexer; accept loop hands sockets to it
src/Mbproxy/Proxy/PlcConnectionPair.cs # DELETED — replaced by UpstreamPipe + Multiplexer
src/Mbproxy/Proxy/IPduPipeline.cs # PduContext gains in-flight correlation entry
src/Mbproxy/Proxy/PerPlcContext.cs # delete PerPlcContextWithRequest; replaced by InFlightRequest passed per-call
src/Mbproxy/Proxy/BcdPduPipeline.cs # FC03/04 response decodes via InFlightRequest, not last-request slot
src/Mbproxy/Proxy/ProxyCounters.cs # new fields: InFlightCount, MaxInFlight, TxIdWraps, BackendDisconnectCascades, BackendQueueDepth
src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs # supervises mux lifecycle alongside listener
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains the new mux fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate mux fields from counters
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show inFlight/max-in-flight in the per-PLC row
docs/design.md # rewrite Connection model + Failure modes for multiplexed reality
mbproxy/CLAUDE.md # flip Architecture summary's connection-model bullet
docs/kpi.md # update operational notes referring to 4-client cap
```
## Tasks
### 9.1 Data types (pure logic)
1. **`TxIdAllocator`** — `internal sealed class TxIdAllocator`. State: `_inUse` (`bool[65536]` for O(1) lookup; ~64 KB), `_next` (`ushort`), `_inFlightCount` (long), `_wrapCount` (long). Methods:
- `bool TryAllocate(out ushort id)` — atomic via `lock` (the allocator is per-PLC, contention is low). Scans forward from `_next` for the next free slot; sets `_inUse[id] = true`; bumps `_next`. Returns `false` if `_inFlightCount == 65536` (saturated; emit `mbproxy.multiplex.saturated` Error and let caller decide to drop or queue).
- `void Release(ushort id)` — clears `_inUse[id]`; decrements `_inFlightCount`.
- `int InFlightCount { get; }`, `long WrapCount { get; }` — for telemetry.
- **Wrap counter:** increment whenever `_next` rolls over `0xFFFF → 0x0000`.
2. **`InFlightRequest` + `InterestedParty`** — `InterestedParty` is `internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId)`. `InFlightRequest` is `internal sealed record InFlightRequest(byte UnitId, byte Fc, ushort StartAddress, ushort Qty, IReadOnlyList<InterestedParty> InterestedParties, DateTimeOffset SentAtUtc)`. Carries enough state for: (a) restoring each party's original TxId on the way back, (b) the FC03/04 correlation the rewriter needs (start/qty), (c) routing the response to each interested upstream socket, (d) round-trip-time measurement.
**In Phase 9 `InterestedParties` always contains exactly one element.** The list shape is forward-compat with [Phase 10 — read coalescing](10-read-coalescing.md), which extends the same record to fan-out responses to multiple upstream clients without further refactor of the multiplexer's data model. Resist any reviewer suggestion to simplify it back to a single `UpstreamPipe Upstream` field — the list shape is the load-bearing foundation for Phase 10.
3. **`CorrelationMap`** — wraps a `ConcurrentDictionary<ushort, InFlightRequest>`. Methods: `bool TryAdd(ushort, InFlightRequest)`, `bool TryRemove(ushort, out InFlightRequest)`, `int Count { get; }`, `IReadOnlyCollection<InFlightRequest> Snapshot()` (for diagnostics; allocates a list). The dict is correct-by-construction for the mux's single-writer-add / single-reader-remove pattern; `ConcurrentDictionary` keeps it safe if/when we add upstream-side cancellation.
### 9.2 Multiplexer + UpstreamPipe
4. **`UpstreamPipe`** — `internal sealed class UpstreamPipe : IAsyncDisposable`. One instance per accepted upstream socket. Fields: `Socket _upstream`, `Guid _id`, `IPEndPoint _remoteEp`, `DateTimeOffset _connectedAtUtc`, `volatile bool _alive`, `Channel<byte[]> _responseChannel` (capacity 16). Two tasks:
- **Read task**: pumps inbound MBAP frames from `_upstream` to a per-pipe `OnFrame` callback (registered by the multiplexer).
- **Write task**: drains `_responseChannel` and writes each frame back to `_upstream`.
On fault: sets `_alive = false`, closes the socket, the multiplexer notices on next correlation lookup and drops responses bound for this pipe.
5. **`PlcMultiplexer`** — `internal sealed class PlcMultiplexer : IAsyncDisposable`. One instance per PLC. Fields: backend `Socket`, `TxIdAllocator`, `CorrelationMap`, `Channel<byte[]> _outboundChannel` (cap 256), `PerPlcContext _ctx` (tag map + counters + logger), list of attached `UpstreamPipe`s. Two backend tasks plus a fan-in:
- **Backend writer task**: drains `_outboundChannel` → writes to backend socket. Single writer; no synchronization on the socket needed.
- **Backend reader task**: reads MBAP frames from backend → looks up `proxyTxId` in `CorrelationMap` → calls `pipeline.Process(ResponseToClient, header, pdu, ctx with InFlight)` → for each `InterestedParty` in `InFlightRequest.InterestedParties` (always exactly one in Phase 9; list-of-N once Phase 10 ships): writes a copy of the frame with that party's `OriginalTxId` restored in the MBAP header to the party's `UpstreamPipe._responseChannel` (or drops silently for that party if its pipe is `_alive = false`) → `CorrelationMap.TryRemove(proxyTxId)` + `TxIdAllocator.Release(proxyTxId)`.
- **Per-upstream `OnFrame`**: invoked by each `UpstreamPipe`'s read task. Steps:
1. Parse MBAP: original TxId, length, unitId, PDU.
2. `TryAllocate` a proxyTxId. If saturated, write a Modbus exception response (Slave Device Failure, code 04) back to upstream and continue.
3. Build `InFlightRequest` (parse FC/start/qty from PDU if FC03/04 — needed for FC06 too if we want the symmetric correlation later).
4. `TryAdd` to correlation map.
5. Call `pipeline.Process(RequestToBackend, ...)` to apply BCD rewriting.
6. Overwrite MBAP TxId bytes with proxyTxId.
7. Enqueue the modified frame into `_outboundChannel`.
6. **Backend disconnect handling** — when the backend reader/writer task throws (socket closed, network reset, etc.):
- Stop both tasks; close the backend socket.
- Walk the correlation map; for each entry, close that entry's `UpstreamPipe` (cascade). Increment `BackendDisconnectCascades` by the upstream-pipe count.
- Clear correlation map and TxIdAllocator.
- The supervisor's Polly pipeline takes over for backend reconnect — when the next upstream request arrives, the multiplexer attempts a fresh backend connection through the Polly pipeline.
### 9.3 Listener + supervisor refactor
7. **`PlcListener.RunAsync`** — accept loop changes:
- One `PlcMultiplexer` per listener (constructed in `PlcListenerSupervisor` and handed in).
- On accept: wrap the socket in `UpstreamPipe`, register with the multiplexer via `mux.Attach(pipe)`.
- On listener stop: dispose the multiplexer (which closes the backend + all attached pipes).
- `ActivePairs` property → renamed `ActiveUpstreams` returning the multiplexer's list of attached `UpstreamPipe`s. Status page consumes this.
8. **Delete `PlcConnectionPair.cs`** — entire file. The replacement is `UpstreamPipe` + `PlcMultiplexer`. No backwards-compat shims; we're moving cleanly.
9. **`PlcListenerSupervisor`** — gains ownership of `PlcMultiplexer` alongside the listener. The Polly listener-recovery pipeline is unchanged; the multiplexer has its own internal Polly backend-connect pipeline (same `ResilienceOptions.BackendConnect` shape as today, just owned by the mux instead of the pair).
### 9.4 Rewriter + counters + status page
10. **`BcdPduPipeline`** — the FC03/04 response path stops reading `PerPlcContextWithRequest.LastRequestStart/Qty`. Instead, the multiplexer attaches an `InFlightRequest` to the `PduContext` for each response call:
```csharp
public sealed class PerPlcContext : PduContext {
public BcdTagMap TagMap { get; init; }
public ProxyCounters Counters { get; init; }
public ILogger Logger { get; init; }
public InFlightRequest? CurrentRequest { get; init; } // NEW — non-null on response, null on request
}
```
Concurrency: each backend response is handled on the backend reader task; the request path is handled by the per-upstream read task. Different `InFlightRequest` instances → no contention.
11. **Drop `PerPlcContextWithRequest`** entirely. The last-request-slot pattern was a 1:1-model workaround; the correlation map subsumes it.
12. **`ProxyCounters` additions:**
- `InFlightCount` (`long` snapshot of `CorrelationMap.Count`)
- `MaxInFlight` (`long`, peak observed via `Interlocked.Max`)
- `TxIdWraps` (`long` from `TxIdAllocator.WrapCount`)
- `BackendDisconnectCascades` (`long`)
- `BackendQueueDepth` (snapshot of `_outboundChannel.Reader.Count`)
13. **Status page**`StatusDto.PlcBackendStatus` gains `InFlight`, `MaxInFlight`, `TxIdWraps`, `DisconnectCascades`, `QueueDepth`. `StatusSnapshotBuilder` populates them. `StatusHtmlRenderer` adds a column or compact `[3/256]` indicator per PLC row. The JSON field names land in camelCase per the existing source-gen convention.
### 9.5 Tests + docs
14. **Unit + integration test suites** (see Tests required below).
15. **`docs/design.md` updates:**
- **Connection model** section: rewrite. The diagram changes from "many clients → many backend sockets" to "many clients → one backend socket per PLC, multiplexed by proxy TxId rewriting." The operational consequence warning flips: instead of "5th client fails," it becomes "if backend disconnects, all attached upstream clients are cascaded closed; they reconnect on their own next request."
- **Failure modes** section: amend to describe the cascade behaviour.
- **Rewriter** section: amend to note the rewriter consumes `InFlightRequest` for response correlation (no architectural change, just an update to the description of how correlation flows).
16. **`mbproxy/CLAUDE.md`** Architecture summary: first bullet flips from "1:1 upstream-client ↔ backend-socket" to "single backend socket per PLC, multiplexed via MBAP TxId rewriting."
17. **`docs/kpi.md`** — the "Tier 2 → Connection-cap saturation warning" KPI loses its meaning (4-client cap no longer relevant on the upstream side). Either remove it or repurpose to track in-flight saturation against the 16-bit TxId space (which never realistically saturates but is the new equivalent ceiling).
## Public surface declared in this phase
All `internal sealed` — the multiplexer types are not consumed outside the assembly.
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal sealed class TxIdAllocator {
public bool TryAllocate(out ushort id);
public void Release(ushort id);
public int InFlightCount { get; }
public long WrapCount { get; }
}
internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
internal sealed record InFlightRequest(
byte UnitId, byte Fc,
ushort StartAddress, ushort Qty,
IReadOnlyList<InterestedParty> InterestedParties,
DateTimeOffset SentAtUtc);
// Phase 9: InterestedParties.Count is always 1.
// Phase 10 (read coalescing): the same record fans out to N parties without further refactor.
internal sealed class CorrelationMap {
public bool TryAdd(ushort proxyTxId, InFlightRequest req);
public bool TryRemove(ushort proxyTxId, out InFlightRequest req);
public int Count { get; }
public IReadOnlyCollection<InFlightRequest> Snapshot();
}
internal sealed class UpstreamPipe : IAsyncDisposable {
public Guid Id { get; }
public IPEndPoint RemoteEp { get; }
public DateTimeOffset ConnectedAtUtc { get; }
public long PdusForwardedCount { get; }
public bool IsAlive { get; }
public Task RunReadLoopAsync(Func<byte[], Task> onFrame, CancellationToken ct);
public ValueTask SendResponseAsync(byte[] frame, CancellationToken ct);
public ValueTask DisposeAsync();
}
internal sealed class PlcMultiplexer : IAsyncDisposable {
public void Attach(UpstreamPipe pipe);
public IReadOnlyCollection<UpstreamPipe> AttachedPipes { get; }
public Task RunAsync(CancellationToken ct);
public ValueTask DisposeAsync();
}
```
`PerPlcContext` gains a nullable `CurrentRequest` property. `PerPlcContextWithRequest` is removed (along with its `LastRequest*` slots).
## Tests required
### Unit (`Category = Unit`)
**`TxIdAllocatorTests`** (≥ 8 tests):
1. `Allocate_FromEmpty_Returns_NextSequential`
2. `Allocate_AfterRelease_Reuses_FreedId`
3. `Allocate_AllocatesEveryUshort_BeforeWrapping`
4. `Allocate_WrapsCorrectly_After0xFFFF`
5. `Allocate_WhenSaturated_ReturnsFalse_DoesNotThrow`
6. `Release_OfNonAllocated_IsNoOp`
7. `Concurrent_AllocateRelease_NoDuplicateIds_Under_Parallel_Stress` (100 tasks, 1000 ops each)
8. `WrapCount_IncrementsOnEachFullWrap`
**`CorrelationMapTests`** (≥ 5 tests):
1. `TryAdd_Then_TryRemove_RoundTrips`
2. `TryAdd_DuplicateKey_Fails`
3. `TryRemove_OfMissing_ReturnsFalse`
4. `Snapshot_ReflectsCurrentState`
5. `Concurrent_AddRemove_NoDataLoss_Under_Parallel_Stress`
**`PlcMultiplexerTests`** (≥ 7 tests, real sockets, no simulator):
1. `SingleUpstream_RoundTripsFC03_Through_Multiplexer`
2. `SingleUpstream_RoundTripsFC06_Through_Multiplexer`
3. `TwoUpstreams_ConcurrentFC03_BothGetCorrectResponses` — proves TxId rewriting works end-to-end against a stub backend
4. `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` — sniff the backend socket; verify per-request TxIds are unique even when upstream TxIds collide
5. `UpstreamDisconnect_DoesNotAffectOtherUpstreams` — drop one client mid-flight; other client's response still arrives
6. `BackendDisconnect_CascadesToAllUpstreams` — kill backend; verify all upstream sockets close within 500 ms, `BackendDisconnectCascades` increments by N
7. `BackendReconnect_AfterCascade_NextUpstreamRequest_Succeeds`
**`RewriterCorrelationTests`** (≥ 4 tests):
1. `FC03Response_DecodedViaInFlightRequest_NotPerPairSlot`
2. `ConcurrentFC03_FromTwoUpstreams_DecodeCorrectly_NoCrossTalk` — set up two `InFlightRequest`s with different start addresses, deliver responses out of order; verify each decodes against its own request
3. `ConcurrentFC06_FromTwoUpstreams_EncodeCorrectly`
4. `ResponseForDeadUpstream_IsDropped_NoExceptionPropagates`
### Integration (`Category = Unit`, no simulator)
These use real `TcpListener` + `Socket` against a stub backend (a `TcpListener` that just echoes or canned-responds). They live in `PlcMultiplexerTests`.
### E2E (`Category = E2E`)
**`MultiplexerE2ETests`** (≥ 5 tests, against pymodbus simulator):
1. `E2E_FiveConcurrentClients_AllReadHR1072_AllGetDecoded_1234` — the headline test. Five NModbus clients connected to the proxy in parallel; pymodbus sim has the BCD register at 1072. All five get `1234`. With Phase 08's 1:1 model, the 5th client would fail at backend connect.
2. `E2E_TwentyConcurrent_FC03_Requests_AcrossThreeClients_AllSucceed`
3. `E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` — kill the sim mid-flight (simulate by closing on its side); verify upstream clients see clean socket close; relaunch sim; new upstream connection succeeds.
4. `E2E_RewriterStillWorks_UnderMultiplexedThreeClients` — three clients each writing different decimal values to different BCD-configured addresses via FC06; verify sim's register state.
5. `E2E_StatusPage_Shows_InFlightAndMaxInFlight` — drive 4 concurrent reads, verify `/status.json` reports `inFlight >= 1` during the burst and `maxInFlight >= 4`.
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All 271 prior tests still green. Specifically: `Forward_FC03_HR1072_Returns_Decoded_1234`, `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`, `MbapTxId_IsPreservedEndToEnd`, and `MbapTxId_StillPreserved_AfterRewriting_20Consecutive` continue to pass against the multiplexed implementation. The MBAP-TxId-preserved tests are the **critical regression guard** — if multiplexing leaks proxy TxIds back to the client, these fail.
- [ ] All new unit tests pass (≥ 24 new in slices 9.1-9.2 alone).
- [ ] All new E2E tests pass (≥ 5).
- [ ] `Forward_FC03_HR1072_Returns_Decoded_1234` PASSES with 5 concurrent NModbus clients connected to the same proxy port. **This is THE phase test.**
- [ ] `PlcConnectionPair.cs` is gone. Grep for the type name across the solution returns zero hits.
- [ ] `PerPlcContextWithRequest` is gone. Grep returns zero hits.
- [ ] `docs/design.md` "Connection model" section is rewritten; the 1:1 model description is gone or moved into a "Historical: pre-Phase-09 model" footnote.
- [ ] `mbproxy/CLAUDE.md` Architecture summary's connection-model bullet is updated.
- [ ] Backend disconnect with N upstream clients in-flight: all N close within 500 ms; counter `BackendDisconnectCascades += N`.
- [ ] `mbproxy.multiplex.saturated` Error event fires if TxId allocator hits 65,536 in-flight. (Stress-test acceptable; manufacture by holding 65,536 pending responses against a stub backend.)
- [ ] Shutdown semantics still work: `ShutdownCoordinator` drains in-flight requests (now visible via `InFlightCount`, not `IsProcessing`).
- [ ] Status page renders the new fields; HTML page weight remains under 50 KB for 54 PLCs.
- [ ] CounterSnapshot's existing field set is preserved — only **added** fields, no renames or removals. Backwards-compat per the policy in `docs/kpi.md`.
## Out of scope
- **Foundation for future caching, not caching itself.** This phase establishes the chokepoint where any future caching or coalescing layer plugs in, but implements no caching of any kind. `InFlightRequest.InterestedParties` is shaped as a list specifically to make [Phase 10 — read coalescing](10-read-coalescing.md) additive without refactor; do not infer caching behavior from the list shape alone. Tier C-2 (short-TTL response cache) and Tier C-3 (periodic poll + cache) remain explicitly out of scope until their own design discussions and `design.md` updates land.
- **Per-tag read coalescing** — if two clients read the same register at the same time, Phase 9's multiplexer sends both requests. Coalescing them into one backend round-trip is the explicit goal of [Phase 10](10-read-coalescing.md), which plugs into the `InterestedParties` seam created here.
- **Backend keepalive / heartbeat** — the design's current "no keepalive" position stands. An idle backend with no upstream activity will die after middlebox timeouts; the next upstream request triggers a fresh connect via Polly. Multiplexing doesn't change this.
- **TxId fairness scheduling** — FIFO order in the `_outboundChannel` is the contract. No round-robin per upstream, no priority. If a single upstream client floods the channel, others queue behind. This is a stated trade-off and matches the ECOM's internal serialization anyway.
- **Pipelined multi-PDU-in-flight per single upstream client** — still unsupported. One in-flight request per upstream pipe at a time. Multiplexing across DIFFERENT upstream clients works fully; multiplexing across multiple in-flight requests from the SAME upstream client does not. Document the constraint.
- **Linux / cross-platform packaging** — still Windows Service only.
## Subagent briefing
If you're the agent picking up this phase, here's the executive summary you need in your head:
1. **You are deleting `PlcConnectionPair`.** Everything that file did is now split between `UpstreamPipe` (the per-client half) and `PlcMultiplexer` (the per-PLC half). Read `PlcConnectionPair.cs` once before you delete it — every behavior in there has a destination in one of the two new classes.
2. **Single-writer / single-reader on the backend socket.** Two tasks share the backend socket: one writes (drained from `_outboundChannel`), one reads (decodes MBAP frames). No third task touches the socket. This invariant is what makes the channel + dictionary design correct without locks.
3. **The rewriter doesn't know about MBAP framing or correlation.** It still receives `(direction, mbapHeader span, pdu span, PerPlcContext ctx)`. The only addition is `ctx.CurrentRequest` (nullable, non-null on response). The rewriter is otherwise unchanged. Resist refactoring it.
4. **`InFlightRequest.SentAtUtc` powers `lastRoundTripMs` correctly across multiplexed clients.** Today's EWMA is per-pair; under multiplexing, the timestamp moves to per-request. The status counter stays the same.
5. **Cascade-on-backend-disconnect is the most subtle behavior.** Get the test for it right early (`BackendDisconnect_CascadesToAllUpstreams`). It's the difference between "graceful failure" and "leaked upstream sockets that hold connections open until OS timeout."
6. **TxId allocator saturation is a real-world impossibility but a stress-test reality.** Hold 65,536 responses in a stub backend; the allocator must refuse the 65,537th cleanly with an exception response code 04, not crash.
7. **Update the docs in the SAME PR as the code.** `design.md` Connection model, `mbproxy/CLAUDE.md` Architecture summary, and `docs/kpi.md` connection-cap KPI either get rewritten or removed. Doc drift is a gate fail.
8. **Do NOT introduce parallel agents within this phase.** The cross-cut is too broad. If you have spare agent budget, slice 9.1 (data types + their unit tests) can run alongside slice 9.5 (e2e test scaffolding writing against the unchanged outer-shape contract) but the middle slices are sequential.
9. **The 4 critical regression tests** that must stay green:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd` ← THIS is the one that proves multiplexing is transparent.
10. **When in doubt, re-read `BcdPduPipeline.ProcessResponse`.** The FC03/04 correlation logic there is the most subtle existing code that you're touching. Walk through it with one upstream client in mind first, then mentally replay with two; both must work without code change to the pipeline (only the way `PerPlcContext.CurrentRequest` gets populated changes).
## Cross-references
- Today's 1:1 model: [`../design.md`](../design.md) → "Connection model" (will be rewritten by this phase).
- DL260 4-client cap source: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Behavioral Oddities".
- Existing rewriter request→response correlation: `src/Mbproxy/Proxy/BcdPduPipeline.cs` `ProcessResponse` (lines reading `PerPlcContextWithRequest.LastRequest*`).
- Polly pipelines this phase reuses without modification: `src/Mbproxy/Proxy/Supervision/PolicyFactory.cs`.
- Counter-snapshot backwards-compat policy: [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
-326
View File
@@ -1,326 +0,0 @@
# Phase 10 — Read coalescing (in-flight only, zero staleness)
When two or more upstream clients send the same FC03/FC04 request to the same PLC while a matching request is already in flight, attach the late arrivals to the existing in-flight entry and fan out the single backend response to all attached clients. Operates entirely within the in-flight window (microseconds to ~10 ms typical) — no post-response caching, no TTL, no staleness contract change.
**Status:** shipped (2026-05-14). All gate items green.
**Depends on:** Phase 09 (multiplexer + `InFlightRequest` with `InterestedParties` list shape).
**Parallel-safe with:** nothing. The phase modifies `PlcMultiplexer.OnFrame` and the backend reader fan-out path; both are tightly coupled.
## Goal
Phase 9's multiplexer routes every upstream request individually, even when two upstream clients are asking for identical data. In a fleet of 54 PLCs where the HMI, historian, and engineering workstation all poll the same screen tags every second, that's up to 3× redundant backend traffic per overlapping read — and the H2-ECOM100's single-request-per-scan internal serialization means redundant traffic compounds into measurable backend latency.
Phase 10 detects same-key reads within the in-flight window and serves them from a single backend response. Coalescing operates entirely between "first request sent to backend" and "response received from backend." Once the response is fanned out, the coalescing entry dies. No values are held past the response arrival; no invalidation logic; no design-doc change to the "not a polling/cache layer" stance.
## Why this is safe — the zero-staleness argument
A coalesced response is a value the backend was going to return to the first request anyway. By the time the second client's request arrives, the first request is already on the wire to the PLC. The PLC's response represents the register values at the moment the PLC serviced the request. Even if the second request had been sent separately on its own backend round-trip, the H2-ECOM100's internal serialization would have queued it behind the first, returning the same value (or a value as old as one extra PLC scan ≈ 2-10 ms older).
In other words: the only thing Phase 10 changes is whether the proxy sends one or two requests to the PLC. The answer the upstream clients see is identical (or fresher than the "two requests" alternative, since coalescing means the second client doesn't wait for a second backend round-trip).
## Outputs (new files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs # readonly record struct
src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs # ConcurrentDictionary wrapper with atomic attach-or-create
src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Multiplexing/CoalescingKeyTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/InFlightByKeyMapTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingTests.cs
tests/Mbproxy.Tests/Proxy/Multiplexing/ReadCoalescingE2ETests.cs
```
## Files modified (existing files in this phase)
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame learns coalescing path; reader fans out
src/Mbproxy/Proxy/ProxyCounters.cs # new: CoalescedHitCount, CoalescedMissCount, CoalescedResponseToDeadUpstream
src/Mbproxy/Options/ResilienceOptions.cs # new: ReadCoalescing sub-options
src/Mbproxy/Admin/StatusDto.cs # PlcBackendStatus gains coalescing fields
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate new fields
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show coalescing ratio in per-PLC row
docs/design.md # Rewriter section: note FC03/04 may be coalesced before reaching backend
docs/kpi.md # graduate "coalescing ratio" KPI from future to supported
install/mbproxy.config.template.json # add the new Resilience.ReadCoalescing section with comments
```
`InFlightRequest.cs` does **not** change — the `InterestedParties` list shape was specifically introduced in Phase 9 to make this phase additive.
## Tasks
### 10.1 Data types
1. **`CoalescingKey`** — `readonly record struct CoalescingKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. Hash key for the in-flight-by-key map. Auto-generated record-struct equality. Verify hashcode distribution is reasonable for typical V-memory address ranges (smoke-test in unit tests).
2. **`InFlightByKeyMap`** — wraps `ConcurrentDictionary<CoalescingKey, InFlightRequest>` plus a small lock for atomic attach-or-create. Methods:
- `bool TryAttachOrCreate(CoalescingKey key, InterestedParty party, Func<InFlightRequest> factory, int maxParties, out InFlightRequest req, out bool wasNew)` — atomic: if the key exists and `req.InterestedParties.Count < maxParties`, append the party to a freshly-built `IReadOnlyList<InterestedParty>` (since the record is immutable, we substitute a new `InFlightRequest` with the extended list in the map) and return `(wasNew=false)`; else call factory to build a new entry, store it, return `(wasNew=true)`.
- `bool TryRemove(CoalescingKey key, out InFlightRequest req)` — called by the backend reader after fan-out completes.
- The "attach to existing" path is the load-bearing concurrency primitive of this phase. The simpler implementation: small `lock` around the attach branch. The lock-free implementation uses `AddOrUpdate` with a comparand check. Pick the simpler one; document the choice in code.
### 10.2 Multiplexer integration
3. **Request path** in `PlcMultiplexer.OnFrame`:
```csharp
bool coalesceCandidate = (fc is 0x03 or 0x04)
&& resilienceOptions.CurrentValue.ReadCoalescing.Enabled;
if (coalesceCandidate)
{
var key = new CoalescingKey(unitId, fc, startAddr, qty);
var party = new InterestedParty(upstreamPipe, originalTxId);
InFlightRequest? req;
bool wasNew;
inFlightByKey.TryAttachOrCreate(
key, party,
factory: () => BuildAndRegisterNew(unitId, fc, startAddr, qty, party),
maxParties: resilienceOptions.CurrentValue.ReadCoalescing.MaxParties,
out req, out wasNew);
if (!wasNew)
{
counters.IncrementCoalescedHit();
return; // do NOT send to backend — first request will get the response
}
counters.IncrementCoalescedMiss();
// fall through: factory already allocated proxyTxId + added to correlation map + sent
return;
}
// FC06/FC16 or coalescing disabled: existing Phase 9 path (allocate, register, send).
```
The factory closure does the existing Phase 9 work (TxId allocate, correlation map add, MBAP rewrite, send to outbound channel). The new code only adds the "is this already in-flight?" check before that work.
4. **Response fan-out** in the backend reader task — already shaped correctly by Phase 9; this phase just makes sure the `CoalescingKey` matching the response is also removed from `InFlightByKeyMap` alongside the `CorrelationMap` removal:
```csharp
if (correlationMap.TryRemove(proxyTxId, out var req))
{
txIdAllocator.Release(proxyTxId);
// Also clear the coalescing key so a new identical request after this point starts fresh.
var key = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
inFlightByKey.TryRemove(key, out _);
// Phase 9's fan-out loop — already iterates InterestedParties.
foreach (var party in req.InterestedParties)
{
if (!party.Pipe.IsAlive)
{
counters.IncrementCoalescedResponseToDeadUpstream();
continue;
}
var partyFrame = WithTxId(responseFrame, party.OriginalTxId);
party.Pipe.SendResponse(partyFrame);
}
}
```
### 10.3 Configuration
5. **Extend `ResilienceOptions`:**
```csharp
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
public sealed class ResilienceOptions
{
public RetryProfile BackendConnect { get; init; } = new();
public RecoveryProfile ListenerRecovery { get; init; } = new();
public ReadCoalescingOptions ReadCoalescing { get; init; } = new(); // ← new
}
```
Hot-reloadable via the existing `IOptionsMonitor<MbproxyOptions>` wiring. Disabling `Enabled` at runtime means new requests take the non-coalescing path; existing in-flight coalesced entries drain naturally.
6. **`mbproxy.config.template.json` update** — add a commented `ReadCoalescing` block to the install template under `Resilience` with the two new keys, default values, and a one-paragraph explanation.
### 10.4 Counters and status surfacing
7. **`ProxyCounters` additions:**
```csharp
public void IncrementCoalescedHit();
public void IncrementCoalescedMiss();
public void IncrementCoalescedResponseToDeadUpstream();
```
`CounterSnapshot` gains `CoalescedHitCount`, `CoalescedMissCount`, `CoalescedResponseToDeadUpstream` — all `long`, all Interlocked. The status page derives `coalescingRatio = Hit / (Hit + Miss)` for display; the raw counts are exposed in JSON for downstream tooling.
8. **`/status.json` per-PLC fields** — extend `PlcBackendStatus`:
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, // ← new
long CoalescedMissCount, // ← new
long CoalescedResponseToDeadUpstream); // ← new
```
9. **HTML page** — extend the per-PLC row with a compact `Coal: 73%` cell (`hit / (hit+miss) * 100`, rounded). Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 10.5 Documentation
10. **`docs/design.md` Rewriter section:** add a paragraph clarifying that FC03/FC04 requests may be coalesced with other in-flight requests of the same `(unitId, fc, start, qty)` before reaching the backend. Emphasize that the transparency contract holds — each client sees its own original TxId restored on the response, and the response value is identical to what an uncoalesced request would have returned (within the PLC's scan-time precision).
11. **`docs/kpi.md` Tier 1:** the new `coalescedHitCount`, `coalescedMissCount`, derived `coalescingRatio` graduate from "future" to "supported" Tier 1 fields. Mention the `coalescedResponseToDeadUpstream` counter as a low-priority Tier 2 informational metric.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Multiplexing;
internal readonly record struct CoalescingKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed class InFlightByKeyMap
{
public bool TryAttachOrCreate(
CoalescingKey key,
InterestedParty party,
Func<InFlightRequest> factory,
int maxParties,
out InFlightRequest req,
out bool wasNew);
public bool TryRemove(CoalescingKey key, out InFlightRequest req);
public int Count { get; }
}
```
```csharp
namespace Mbproxy.Options;
public sealed class ReadCoalescingOptions
{
public bool Enabled { get; init; } = true;
public int MaxParties { get; init; } = 32;
}
// Added field on existing ResilienceOptions:
public ReadCoalescingOptions ReadCoalescing { get; init; } = new();
```
`ProxyCounters` and `CounterSnapshot` gain three new `long` fields. No public-surface removals, no renames.
## Tests required
### Unit (`Category = Unit`)
**`CoalescingKeyTests`** (≥ 4 tests):
1. `Equality_OnIdenticalKeys_ReturnsTrue`
2. `Equality_OnDifferentFc_ReturnsFalse` — FC03 vs FC04 with same start/qty/unit are NOT equal (different Modbus tables).
3. `Equality_OnDifferentUnitId_ReturnsFalse`
4. `HashCode_DistributionSanity` — build 10,000 randomly-generated keys, bucket by `Key.GetHashCode() & 0xFF`, assert no bucket has > 5 % of total (rough uniformity check).
**`InFlightByKeyMapTests`** (≥ 6 tests):
1. `TryAttachOrCreate_NewKey_CallsFactory_ReturnsTrue_WasNewTrue`
2. `TryAttachOrCreate_ExistingKey_AppendsParty_ReturnsTrue_WasNewFalse`
3. `TryAttachOrCreate_ExistingKey_AtMaxParties_CreatesFreshEntry_NotAppend` — refuses to fan out beyond the cap; preserves backend-load-shedding guarantee.
4. `TryRemove_AfterAttach_AllPartiesPresent_InRetrievedEntry`
5. `TryRemove_OfMissing_ReturnsFalse`
6. `Concurrent_AttachOrCreate_From_Two_Threads_NoLostParties_AndNoDuplicateEntries` — 100 tasks × 1000 ops each.
**`ReadCoalescingTests`** (≥ 7 tests, real sockets, stub backend):
1. `TwoClients_SameRequest_OnlyOneBackendRoundTrip` — stub backend counts received requests; assert 1.
2. `TwoClients_DifferentRequests_BothHitBackend` — different start addresses; assert 2.
3. `FiveClients_SameRequest_OneBackendRoundTrip_FiveResponses` — fan-out works correctly with 5 attached parties.
4. `FC03_And_FC04_SameAddress_NOT_Coalesced` — different tables.
5. `FC06_Write_NeverCoalesced` — writes always allocate their own TxId.
6. `OneClient_DisconnectsMidFlight_OthersStillGetResponse_AndDeadUpstreamCounterIncrements`
7. `AtMaxParties_NextRequest_StartsFreshBackendRoundTrip` — verify the cap behaviour: when `MaxParties = 2` and 3 simultaneous clients send the same request, the third opens a new in-flight entry rather than joining the first.
### E2E (`Category = E2E`)
**`ReadCoalescingE2ETests`** (≥ 5 tests, against pymodbus simulator, `[Collection(nameof(DL205SimulatorCollection))]`):
1. `E2E_FiveConcurrentClients_SameReadHR1072_CoalescedHitCount_AtLeast_3` — five NModbus clients connect to the proxy, simultaneously read HR1072 (BCD-configured). Assert `coalescedHitCount >= 3` (race wiggle room — perfect coalescing would give 4 hits, but the racy first-arrivals can both miss).
2. `E2E_RewriterStillWorks_ForAllCoalescedParties` — same setup, but with BCD tag at 1072. All five clients receive decoded `1234`. Proves the rewriter sees a coalesced response correctly and the TxId restoration doesn't perturb the BCD bytes.
3. `E2E_DifferentRegisters_NotCoalesced_CoalescedHitCount_Zero` — five clients reading five different addresses; assert no coalescing happened.
4. `E2E_StatusPage_Shows_CoalescingRatio``/status.json` for the test PLC has populated `coalescedHitCount` and `coalescedMissCount` after the burst.
5. `E2E_DisableViaHotReload_RevertToPhase9Behaviour` — write a temp appsettings with `ReadCoalescing.Enabled = false`, hot-reload, verify subsequent identical reads each hit the backend separately (counter doesn't increment).
## Phase gate
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — specifically the **4 critical Phase-9 regression guards**:
- `Forward_FC03_HR1072_Returns_Decoded_1234`
- `Forward_FC06_WriteHR200_ThenReadBack_RoundTrips`
- `Forward_FC16_WriteMultipleHR201_203_ThenReadBack_RoundTrips`
- `MbapTxId_IsPreservedEndToEnd`
- [ ] All new unit + e2e tests pass (≥ 17 new).
- [ ] **Headline assertion:** 5 concurrent FC03 reads of the same register through the proxy produce **at most 2** backend round-trips (allowing one race for the initial pair). Verifiable via stub-backend's request counter in `ReadCoalescingTests`.
- [ ] FC04 reads of the same address as a coexisting FC03 stream do NOT coalesce together. Verified by an explicit test.
- [ ] FC06 / FC16 writes are NEVER on the coalescing path. Verified by setting `MaxParties = 1` and confirming write throughput is unaffected.
- [ ] Coalescing-ratio counter ≥ 50 % under the headline stress test (5 simultaneous identical reads).
- [ ] Disabling coalescing via `Mbproxy.Resilience.ReadCoalescing.Enabled = false` hot-reloads cleanly; running coalesced entries drain naturally without errors.
- [ ] `docs/design.md` Rewriter section mentions the coalescing path; `docs/kpi.md` Tier 1 includes the new fields; `install/mbproxy.config.template.json` includes the new commented `Resilience.ReadCoalescing` block.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
## Out of scope
- **Post-response caching** — no TTL, no staleness window beyond "while the request is in flight." This phase is strictly in-flight. A response-cache phase would be a separate plan (Phase 11+) and would require the design.md "not a cache layer" stance to be revisited and rewritten.
- **Range-overlap coalescing** — request A reading [100..110], request B reading [105..115]. Different keys; no coalescing. Range-overlap detection is a separate optimisation with its own algorithmic complexity (interval trees, etc.) and its own staleness questions (request B's response would include reg 100..104 from A's perspective, but those weren't in B's response).
- **Cross-PLC coalescing** — each PLC's multiplexer has its own key map. No optimization across PLCs (their backend connections are independent anyway).
- **Write coalescing / batching** — different problem with non-idempotency concerns. The design doc's "no mid-request retry on writes" principle extends to "no write coalescing."
- **Predictive batching** — combining a single client's likely-next read into the current request. Out of scope; speculative reads are a different optimization category.
- **Adaptive `MaxParties`** — staying at the configured value. Auto-tuning is interesting but speculative.
## Subagent briefing
If you're the agent picking up this phase:
1. **Phase 9's `InterestedParties` list is the seam.** This phase only adds the "look up the key, attach a new party to an existing entry" logic. The fan-out side already iterates the list correctly. If you find yourself rewriting Phase 9's response path, you've drifted out of scope.
2. **`CoalescingKey` includes `UnitId`.** DL260 fleets typically use unit 1, but we don't assume — different unit IDs are different PLC personalities behind the same TCP socket and must not coalesce.
3. **FC03 and FC04 are different tables.** Same register address space in DL series, but Modbus treats them separately. Different `CoalescingKey` for the same address; no coalescing across them.
4. **Coalescing is best-effort under races.** Two simultaneous identical requests can both miss the map and create separate entries — counter just shows a lower ratio. Not a bug; documented behaviour. Do not over-engineer with double-checked locking.
5. **`MaxParties` is the load-shedding safety valve.** If a thousand HMI panels all attach to one in-flight request, the response fan-out cost goes linear with attachment count and stalls the backend reader task. Cap at 32 by default. Past the cap, route through a fresh entry — fan-out cost per entry is bounded.
6. **The attach-or-create operation MUST be atomic per key.** Two simultaneous arrivals must not both create new entries for the same key (would defeat coalescing). The simpler implementation: `lock(map.SyncRoot)` around the attach branch. The lock-free implementation uses `AddOrUpdate` with the updateFactory checking the count cap. Pick whichever you can write correctly in 30 minutes; document the choice.
7. **Response fan-out must check `Pipe.IsAlive` per party.** An upstream client that disconnects between attaching and the response arriving — count it as `CoalescedResponseToDeadUpstream` and continue with the others. Do not throw, do not log per-occurrence at Information (would be too noisy under client churn).
8. **Hot-reload of `Enabled` doesn't disrupt in-flight entries.** Disabling the feature mid-flight just means subsequent requests take the non-coalescing path. Existing coalesced entries drain when their response arrives. Don't try to "flush" them on the reload event.
9. **`CoalescedHit + CoalescedMiss = total FC03+FC04 requests`.** The math has to balance per snapshot. Use `Interlocked.Increment` exclusively. Disabling coalescing means every FC03/04 request becomes a Miss (which is fine — the metric still tracks total reads).
10. **Update `design.md` AND `kpi.md` AND the install template in the same PR as the code.** Doc drift is a gate failure. The coalescing-ratio KPI specifically graduates from "future" to "Tier 1 supported" — make that promotion explicit in `kpi.md`.
## Cross-references
- Phase 9's multiplexer is the foundation. The `InterestedParty` and `InterestedParties` types live there: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (rates / percentiles / availability — coalescing-ratio joins this tier).
- Modbus unit-ID semantics that make coalescing-key uniqueness load-bearing: [`../../DL260/dl205.md`](../../DL260/dl205.md) → "Function Code Support" and "Coils and Discrete Inputs".
- Counter snapshot backwards-compat policy that this phase respects (additive only): [`../kpi.md`](../kpi.md) → "Backwards-compat policy".
## Clarifications discovered during implementation
These are the implementation details that the original phase doc did not pin down; recorded here so the next reader doesn't relearn them.
1. **`InterestedParties` is a `List<InterestedParty>` cast to `IReadOnlyList`.** Phase 9 typed the field as `IReadOnlyList<InterestedParty>` to leave room for any implementation; Phase 10 specifically requires a mutable list so the map can append parties under its lock. The list is mutated only under `InFlightByKeyMap`'s lock, and the reader's fan-out iterates the list ONLY after the entry has been removed from the map — by that point no further appends are possible. There is no separate snapshot copy.
2. **The factory closure performs the Phase-9 work (allocate TxId + add to CorrelationMap) but does NOT enqueue to the outbound channel.** The channel send happens AFTER returning from `TryAttachOrCreate` so the InFlightByKey lock is not held across a potentially-async send. The factory communicates its allocated proxy TxId and InFlightRequest back to the caller through closure-captured locals. If the allocator is saturated, the factory returns a "stub" InFlightRequest with no CorrelationMap entry; the caller detects this and delivers a Modbus exception 04.
3. **`coalescedHitCount + coalescedMissCount` = total FC03/FC04 requests (always).** Even when coalescing is disabled, every FC03/04 request bumps `coalescedMissCount` from the non-coalescing path. This keeps the math balanced for dashboard consumers regardless of feature state. Writes (FC06/FC16) are NOT in this accounting — they never touch the coalescing path.
4. **Cascade and watchdog paths drain `InFlightByKeyMap` too.** On backend disconnect, `TearDownBackendAsync` calls `_inFlightByKey.DrainAll()` so a brand-new identical request through the freshly-reconnected backend is treated as a miss. On per-request watchdog timeout, `_inFlightByKey.TryRemove(key)` runs alongside the CorrelationMap removal so subsequent identical requests start fresh.
5. **Live config accessor, not `IOptionsMonitor`-by-value.** The multiplexer takes a `Func<ReadCoalescingOptions>` accessor that resolves to `optionsMonitor.CurrentValue.Resilience.ReadCoalescing` per PDU. This keeps the constructor surface lightweight (no DI on `IOptionsMonitor<MbproxyOptions>`) and gives tests a clean way to pin a fixed config. Hot-reload of `Enabled` propagates because the accessor is read on every incoming FC03/FC04 request.
6. **Phase 9's `TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire` test required a one-line edit.** It asserted ≥2 distinct backend TxIds from two identical FC03 reads — exactly the case Phase 10 now coalesces. The test was patched to use DIFFERENT start addresses so the two reads remain non-coalescable while still proving distinct proxy TxIds. The rest of Phase 9's tests are unaffected.
7. **pymodbus simulator and coalescing.** The simulator's `last_pdu`-overwrite bug (documented in design.md) means we cannot E2E-verify "five concurrent identical reads → 1 backend round-trip" against pymodbus. The headline-stress correctness claim is therefore proven against the stub backend in `ReadCoalescingTests` (real loopback sockets, deterministic 200400 ms response delay so the in-flight window is wide enough for racing requests to actually overlap). The E2E suite verifies counter accounting, status-page surfacing, and the rewriter integration on serialised reads — i.e. the integration boundary, not the concurrency proof.
-414
View File
@@ -1,414 +0,0 @@
# Phase 11 — Short-TTL response cache (bounded staleness)
Cache FC03/FC04 responses with a per-tag TTL. Subsequent same-key reads within the TTL window are served from the cache without backend traffic. FC06/FC16 writes invalidate overlapping cache entries on the response side. **This phase is a deliberate design-contract change** — the proxy gains an opt-in cache layer with explicit bounded staleness.
**Status:** post-1.0 follow-on, depends on Phase 10. **Architectural pivot — read the "Design pivot" section below before scoping.**
**Depends on:** Phase 09 (multiplexer chokepoint), Phase 10 (`CoalescingKey` is reused as `CacheKey` — same shape).
**Parallel-safe with:** nothing.
## Design pivot — do NOT skip this section
Phases 09 and 10 were additive performance optimisations that preserved the design's "transparent inline proxy" contract. **Phase 11 is different.** It changes the load-bearing claim in `docs/design.md`:
- **Today's contract** (lines 12-20 of `design.md`): *"The service is not a polling/cache layer. It is a transparent Modbus TCP proxy whose job is to rewrite the configured BCD tags in real time, in both directions, while proxying every other byte of the MBTCP connection untouched."*
- **Post-Phase-11 contract:** the proxy is *optionally* a cache layer within a bounded TTL. The TTL is per-tag, default 0 (no caching), opt-in by operator action.
Implication: **Task 1 of this phase is rewriting the relevant `design.md` sections.** The contract update is a code commit too — review, land first, then build the implementation against the new contract. Shipping cache code while design.md still says "not a cache layer" is a gate failure, not a merge-it-and-fix-later situation.
The cache is **OFF by default**. A fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. The opt-in shape (per-tag `CacheTtlMs` configuration) means a deployment can adopt Phase 11 without changing semantics until an operator explicitly opts a tag in.
## Goal
Reduce backend Modbus traffic for the common SCADA case where many clients poll the same registers at near-identical cadences. Phase 10 already coalesces within the in-flight window (~10 ms). Phase 11 extends the "served without backend traffic" window from the in-flight microseconds to operator-configurable seconds.
Concretely: with `CacheTtlMs = 1000` on a frequently-read BCD tag, the backend sees at most one read of that tag per second per PLC regardless of how many upstream clients are polling.
## What it does NOT do
- **No active polling.** Cache entries are populated on demand by upstream reads, not by proactive polling. (Active polling is Tier C-3 from the conversation history — a separate phase if ever wanted.)
- **No predictive prefetching.**
- **No SCADA-style subscription/notification model.**
- **No write-back caching.** Writes always go straight through to the backend; cache invalidation happens on the write-response side, not by intercepting the write.
- **No cross-PLC caching.** Each PLC's cache is independent.
- **No persistence.** Process restart wipes the cache. Cache survives backend disconnects (the cached data was fresh when stored; disconnects don't retroactively invalidate it).
## Outputs (new files)
```
src/Mbproxy/Proxy/Cache/CacheKey.cs # reuses CoalescingKey shape; type-aliased or reflected
src/Mbproxy/Proxy/Cache/CacheEntry.cs # response bytes + expiry + lastFetched
src/Mbproxy/Proxy/Cache/ResponseCache.cs # the cache itself; TTL-based eviction, LRU under cap
src/Mbproxy/Proxy/Cache/CacheInvalidator.cs # address-range-overlap matcher for write invalidation
src/Mbproxy/Proxy/Cache/CacheLogEvents.cs # [LoggerMessage] vocab for this phase
tests/Mbproxy.Tests/Proxy/Cache/CacheKeyTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheEntryTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheTests.cs
tests/Mbproxy.Tests/Proxy/Cache/CacheInvalidatorTests.cs
tests/Mbproxy.Tests/Proxy/Cache/ResponseCacheE2ETests.cs
```
## Files modified
```
src/Mbproxy/Proxy/Multiplexing/PlcMultiplexer.cs # OnFrame: cache check BEFORE coalescing; OnResponse: cache store + write invalidation
src/Mbproxy/Options/BcdTagOptions.cs # add CacheTtlMs (default 0 = no caching)
src/Mbproxy/Options/PlcOptions.cs # add DefaultCacheTtlMs
src/Mbproxy/Options/MbproxyOptions.cs # add Cache section (AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs)
src/Mbproxy/Bcd/BcdTag.cs # carry CacheTtlMs on the record
src/Mbproxy/Bcd/BcdTagMapBuilder.cs # resolve per-tag TTL with per-PLC default fallback
src/Mbproxy/Proxy/ProxyCounters.cs # new: CacheHit, CacheMiss, CacheInvalidations, CacheEntryCount, CacheBytes
src/Mbproxy/Admin/StatusDto.cs # surface cache KPIs in PlcBackendStatus
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # populate
src/Mbproxy/Admin/StatusHtmlRenderer.cs # show cache-hit ratio per PLC row
src/Mbproxy/Configuration/ReloadValidator.cs # validate CacheTtlMs bounds; require AllowLongTtl=true for > 60s
docs/design.md # SUBSTANTIAL — see Task 1
docs/kpi.md # graduate cache KPIs from future to Tier 1
install/mbproxy.config.template.json # add CacheTtlMs examples + staleness commentary
mbproxy/CLAUDE.md # Architecture summary: add the cache-layer bullet
```
## Tasks
### 11.1 Design contract update — **DO THIS FIRST**
1. **`docs/design.md` updates** (review and land before writing implementation code):
**a. "What this is" section** — add the cache disclosure paragraph:
> As of Phase 11, the proxy gains an *optional* per-tag response cache with a bounded staleness window (`CacheTtlMs`). The cache is OFF by default (`CacheTtlMs = 0`) and must be opt-in per tag. With caching enabled, the proxy is no longer purely transparent — upstream reads may return a value up to `CacheTtlMs` milliseconds old. The 1:1 read-to-backend-request guarantee no longer holds; operators opting tags into caching MUST acknowledge the staleness bound.
**b. New section "Cache contract"** between "Rewriter" and "Failure modes":
- Cache populates on demand only. No polling.
- Cache entries carry their TTL with them. Hits older than TTL are evicted on access.
- FC06/FC16 successful responses invalidate cache entries whose address range overlaps the write.
- Cache survives backend disconnects (cached data was valid at cache time).
- Cache does NOT survive process restart.
- Multi-tag read range: effective TTL is the minimum of all configured tags in the range. Any tag with TTL = 0 in the range disables caching for the whole read.
- Cache stores POST-rewriter bytes (BCD already decoded). Hits bypass the rewriter entirely.
**c. "Failure modes" section** — add bullet on cache behaviour during backend recovery:
- Cache hits remain valid during a `recovering` listener state. Data was fresh when cached; recovery only affects future requests.
- Invalidations during recovery: writes that arrive cannot reach the backend, so the invalidation never happens. This is consistent — the write didn't take effect either. Cache entries remain valid until their TTL expires.
**d. "Rewriter" section** — clarify that the rewriter runs on the cache-miss path (decode on store), and that cache hits return pre-decoded bytes without re-invoking the rewriter.
Treat (a)-(d) as one atomic change. Get them reviewed, land them, then implement against the new contract.
### 11.2 Cache key
2. **`CacheKey`** — same shape as Phase 10's `CoalescingKey`: `readonly record struct CacheKey(byte UnitId, byte Fc, ushort StartAddress, ushort Qty)`. If Phase 10 is already merged, prefer **a `using CacheKey = CoalescingKey;` alias** over a redefinition — same data, same hashing, single source of truth. If the two phases land together (Phase 10 + 11 in a coordinated release), consider renaming `CoalescingKey``ReadKey` to make the shared use site neutral.
### 11.3 Cache entry and storage
3. **`CacheEntry`** — `internal sealed record CacheEntry(byte[] PduBytes, DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc, int Length, ushort LastUsedTick)`. `LastUsedTick` is a monotonic counter for LRU ordering (avoids `DateTimeOffset.UtcNow` calls on every cache access).
4. **`ResponseCache`** — `internal sealed class ResponseCache : IDisposable`. Methods:
- `bool TryGet(CacheKey key, out CacheEntry entry)` — returns true ONLY if entry exists and `entry.ExpiresAtUtc > DateTimeOffset.UtcNow`. Updates `LastUsedTick` on hit. Expired entries removed lazily.
- `void Set(CacheKey key, CacheEntry entry)` — replaces any existing entry. If `Count >= MaxEntriesPerPlc`, evict the LRU entry first.
- `int Invalidate(byte unitId, ushort startAddress, ushort qty)` — delegates to `CacheInvalidator`. Returns count invalidated.
- `int Count { get; }`, `long ApproximateBytes { get; }`
- Background eviction loop (started in constructor, stopped in `Dispose`): every `EvictionIntervalMs` (default 5000), scans the map and removes entries past `ExpiresAtUtc`.
5. **`CacheInvalidator`** — pure logic: `static IEnumerable<CacheKey> FindOverlapping(IReadOnlyCollection<CacheKey> haystack, byte unitId, ushort writeStart, ushort writeQty)`. Returns keys whose range `[StartAddress, StartAddress + Qty)` intersects `[writeStart, writeStart + writeQty)`. Limit scope to keys matching `unitId` and `Fc in {3, 4}` (we never cache writes; invalidation only applies to read entries).
### 11.4 Multiplexer integration
6. **Cache lookup in `PlcMultiplexer.OnFrame`** — for FC03/04 requests when the read range has a non-zero resolved TTL:
```csharp
if (fc is 0x03 or 0x04 && resolvedTtlMs > 0) {
var key = new CacheKey(unitId, fc, startAddr, qty);
if (cache.TryGet(key, out var entry)) {
counters.IncrementCacheHit();
// Build a fresh MBAP wrapper for this client and send.
var hitFrame = BuildResponseFrame(entry.PduBytes, originalTxId, unitId);
upstreamPipe.SendResponse(hitFrame);
return; // no coalescing check, no backend round-trip
}
counters.IncrementCacheMiss();
}
// Fall through to Phase 10 coalescing path → Phase 9 send path
```
**Order matters:** cache check FIRST, then coalescing. A cache hit short-circuits everything; only on a miss do we engage Phase 10's coalescing logic.
7. **Cache store on response** — in the backend reader fan-out path, AFTER the rewriter has run on the response:
```csharp
if (req.Fc is 0x03 or 0x04 && req.ResolvedCacheTtlMs > 0) {
var key = new CacheKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
var now = DateTimeOffset.UtcNow;
var entry = new CacheEntry(
PduBytes: rewrittenPduBytes.ToArray(), // defensive copy
CachedAtUtc: now,
ExpiresAtUtc: now.AddMilliseconds(req.ResolvedCacheTtlMs),
Length: rewrittenPduBytes.Length,
LastUsedTick: NextLruTick());
cache.Set(key, entry);
}
```
Note: `req.ResolvedCacheTtlMs` is computed at request-receive time by walking the BcdTagMap for tags in `[StartAddress, StartAddress + Qty)` and taking `min(CacheTtlMs)`. If any tag has TTL = 0, `ResolvedCacheTtlMs = 0` and the whole read is uncached.
8. **Cache invalidation on write response** — FC06 / FC16 successful response (NOT exception response):
```csharp
if (req.Fc is 0x06 or 0x10 && (fc & 0x80) == 0) {
int invalidated = cache.Invalidate(req.UnitId, req.StartAddress, req.Qty);
if (invalidated > 0) {
counters.AddCacheInvalidations(invalidated);
CacheLogEvents.WriteInvalidatedEntries(logger, req.UnitId,
req.StartAddress, req.Qty, invalidated);
}
}
```
Invalidation is by ADDRESS RANGE OVERLAP, not by exact key match. A write to register 105 invalidates a cached read of [100..110] and a cached read of [105..115] but NOT a cached read of [200..210].
### 11.5 Per-tag TTL configuration
9. **`BcdTagOptions` extension:**
```csharp
public sealed class BcdTagOptions {
public ushort Address { get; init; }
public byte Width { get; init; }
public int CacheTtlMs { get; init; } = 0; // 0 = no caching (default)
}
```
10. **`PlcOptions.DefaultCacheTtlMs`** — applies to any tag whose explicit `CacheTtlMs` was not set (use a nullable `int?` on `BcdTagOptions` instead of `int = 0` to distinguish "explicitly zero" from "unset"). Default for the PLC default itself is 0.
11. **`MbproxyOptions.Cache` section:**
```csharp
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false; // gate for TTL > 60_000
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
```
12. **Validation** in `ReloadValidator`: `CacheTtlMs >= 0` always; `CacheTtlMs > 60_000` requires `Cache.AllowLongTtl = true`. Reject reloads that violate. Prevents "left at 1 hour by accident" deployments.
13. **`BcdTagMapBuilder.Build` resolution**: returns each `BcdTag` with `CacheTtlMs` resolved per fallback rules: explicit per-tag → per-PLC default → 0.
### 11.6 Counters and status surfacing
14. **`ProxyCounters` additions:**
- `CacheHitCount` (Interlocked long)
- `CacheMissCount` (Interlocked long)
- `CacheInvalidations` (Interlocked long)
- `CacheEntryCount` (snapshot from `ResponseCache.Count` — read-time)
- `CacheBytes` (snapshot from `ResponseCache.ApproximateBytes` — read-time)
15. **`StatusDto.PlcBackendStatus` extension:**
```csharp
public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsFailed,
ExceptionCounts ExceptionsByCode,
double LastRoundTripMs,
long CoalescedHitCount, long CoalescedMissCount, long CoalescedResponseToDeadUpstream, // Phase 10
long CacheHitCount, long CacheMissCount, // Phase 11
long CacheInvalidations, long CacheEntryCount, long CacheBytes); // Phase 11
```
16. **HTML page** — add a compact `Cache: 73%` cell per PLC row. Page-weight assertion (under 50 KB for 54 PLCs) must continue to pass.
### 11.7 Documentation and template
17. **`docs/kpi.md`** — graduate cache-hit-ratio KPIs from "deferred / future" to Tier 1 supported. Add `cacheEntryCount` and `cacheBytes` as Tier 2 memory-watch KPIs.
18. **`install/mbproxy.config.template.json`** — add a fully-commented `Mbproxy.Cache` section showing `AllowLongTtl`, `MaxEntriesPerPlc`, `EvictionIntervalMs`. Show example per-tag `CacheTtlMs: 1000` and per-PLC `DefaultCacheTtlMs: 500` entries. Include a prominent comment explaining the staleness contract: "**clients reading these tags will see values up to `CacheTtlMs` milliseconds old**".
19. **`mbproxy/CLAUDE.md` Architecture summary** — add a bullet:
> - **Optional response cache** with per-tag TTL (default 0 = off). Cached FC03/04 responses serve subsequent same-key reads without backend traffic; FC06/FC16 write responses invalidate overlapping entries by address range.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Proxy.Cache;
internal readonly record struct CacheKey(
byte UnitId, byte Fc, ushort StartAddress, ushort Qty);
internal sealed record CacheEntry(
byte[] PduBytes,
DateTimeOffset CachedAtUtc, DateTimeOffset ExpiresAtUtc,
int Length, ushort LastUsedTick);
internal sealed class ResponseCache : IDisposable {
public bool TryGet(CacheKey key, out CacheEntry entry);
public void Set(CacheKey key, CacheEntry entry);
public int Invalidate(byte unitId, ushort startAddress, ushort qty);
public int Count { get; }
public long ApproximateBytes { get; }
public void Dispose();
}
internal static class CacheInvalidator {
public static IEnumerable<CacheKey> FindOverlapping(
IReadOnlyCollection<CacheKey> haystack,
byte unitId, ushort writeStart, ushort writeQty);
}
```
```csharp
namespace Mbproxy.Options;
public sealed class CacheOptions {
public bool AllowLongTtl { get; init; } = false;
public int MaxEntriesPerPlc { get; init; } = 1000;
public int EvictionIntervalMs { get; init; } = 5000;
}
// Added field on MbproxyOptions:
public CacheOptions Cache { get; init; } = new();
// Added field on BcdTagOptions (nullable to distinguish "unset" from "explicitly 0"):
public int? CacheTtlMs { get; init; }
// Added field on PlcOptions:
public int DefaultCacheTtlMs { get; init; } = 0;
```
`ProxyCounters` and `CounterSnapshot` gain 5 new long fields. No public-surface removals or renames.
## Tests required
### Unit (`Category = Unit`)
**`CacheKeyTests`** (≥ 3 tests): equality across identical keys; FC03 vs FC04 differs; UnitId differs.
**`CacheEntryTests`** (≥ 3 tests): expired detection at boundary; immutability of `PduBytes`; LRU tick monotonicity.
**`CacheInvalidatorTests`** (≥ 5 tests, range-overlap math):
1. `FullOverlap_WriteCoversEntryRange_Invalidates`
2. `PartialOverlap_WriteStartsBeforeEntry_Invalidates`
3. `PartialOverlap_WriteEndsAfterEntry_Invalidates`
4. `Adjacent_NotOverlapping_DoesNotInvalidate` — write to `[10..15]` does NOT invalidate cached `[15..20]` (half-open intervals — `15` is not in the entry's range).
5. `NoOverlap_DoesNotInvalidate`
6. `DifferentUnitId_DoesNotInvalidate`
**`ResponseCacheTests`** (≥ 8 tests):
1. `SetThenGet_RoundTrips`
2. `GetExpiredEntry_ReturnsFalse_AndRemoves` — uses a small TTL + `Task.Delay`
3. `Invalidate_OverlappingRange_RemovesMatching` — set 3 entries, invalidate a range overlapping 2 of them, verify Count drops by 2
4. `Invalidate_OnlyAffectsFc03Fc04_KeysWithFcOther_NotTouched` — there shouldn't be FC06/FC16 entries in cache, but a defensive test
5. `Set_AtMaxEntries_EvictsLRU`
6. `LRU_TracksAccessOrder_Across_Get_And_Set`
7. `Concurrent_GetSet_NoDataRace` — 100 tasks, 1000 ops each
8. `Dispose_StopsEvictionLoop`
### E2E (`Category = E2E`)
**`ResponseCacheE2ETests`** (≥ 6 tests, against pymodbus simulator):
1. `E2E_CacheHit_AfterFirstRead_NoBackendTraffic` — configure tag at HR1072 with `CacheTtlMs = 5000`; first read goes to backend; second read within 5s hits cache. Verify via the simulator's HTTP introspection or by timing (cache hits return ~ms; backend reads return ~10ms).
2. `E2E_CacheExpires_AfterTtl_NextReadHitsBackend` — short TTL (e.g., 200 ms); after delay, second read goes to backend.
3. `E2E_WriteInvalidatesOverlappingCacheEntries` — read HR1072 (cache it), write to HR1072 with FC06, next read MUST miss cache and re-fetch.
4. `E2E_NonOverlappingWrite_DoesNotInvalidate` — read HR1072 (cache it), write to HR1080, next read of HR1072 still hits cache.
5. `E2E_BcdDecodedBytesAreCached_NotRawBcd` — cache hit returns the decoded `1234`, not `0x1234`. Proves the cache stores post-rewriter bytes.
6. `E2E_DisablingCache_ViaHotReload_FlushesEntries` — set `CacheTtlMs = 1000` on a tag, do a read (cached), hot-reload with `CacheTtlMs = 0`, next read must hit the backend even though the old entry is still within its TTL window.
7. `E2E_MultiTagRead_RangeWithZeroTtlTag_DisablesCaching` — read [100..110] where one tag in the range has `CacheTtlMs = 0`; verify no caching of the whole read.
## Phase gate
- [ ] **`docs/design.md` updates from Task 1 are merged FIRST** (or in the same PR). The contract change is not optional and not deferrable. Gate fail otherwise.
- [ ] `dotnet build Mbproxy.slnx -c Debug` — zero warnings, zero errors.
- [ ] All prior tests still green — the **4 critical Phase-9 regression guards** + **Phase 10's coalescing tests**.
- [ ] All new unit + e2e tests pass (≥ 25 new).
- [ ] **Default TTL = 0 → no observable behavior change vs Phase 10.** Verify: run the full Phase 10 test suite with the Phase 11 build; everything green.
- [ ] **Headline assertion (E2E):** configure `CacheTtlMs = 1000` on HR1072; issue 10 reads at 100 ms intervals; backend (stub or sim with introspection) sees exactly 1 backend round-trip.
- [ ] Write invalidation correctly handles all 6 range-overlap cases (full, two partial, adjacent, none, different-unit-id).
- [ ] Memory cap enforced: with `MaxEntriesPerPlc = 5`, 6 distinct cache inserts produce 5 entries (one LRU eviction observed).
- [ ] Validation rejects `CacheTtlMs > 60_000` unless `Cache.AllowLongTtl = true`.
- [ ] Hot-reload of `CacheTtlMs` flushes entries for the affected tag (or, simpler: flushes the entire cache for the PLC). Pick the simpler option (PLC-wide flush) and document.
- [ ] HTML page weight under 50 KB for 54 PLCs (verify with the existing renderer test).
- [ ] `docs/kpi.md` Tier 1 includes cache-hit-ratio.
- [ ] `install/mbproxy.config.template.json` includes the new `Mbproxy.Cache` block with the staleness commentary.
## Out of scope
- **Active polling** — cache populates on demand only. No background poll loop.
- **Predictive prefetching** — no speculative reads.
- **Range-overlap coalescing of cache entries** — if reads `[100..110]` and `[105..115]` are both cached, no attempt to merge them into one `[100..115]` entry. Same-key only.
- **Cross-PLC caching** — each PLC's cache is independent. No optimisation across PLCs.
- **Persistence** — process restart wipes the cache. No file/Redis backing store.
- **Cache warming** — no pre-populating the cache from a snapshot, last-known-good file, etc.
- **TTL > 60 seconds without explicit `AllowLongTtl` opt-in** — refused at validation.
- **Adaptive TTL** — operator-configured only. No auto-tuning.
## Subagent briefing
If you're the agent picking up this phase:
1. **Task 1 is design.md, not code.** The contract update is the gate. Do not write the cache code until the design changes have been reviewed and merged (or are in the same PR with explicit reviewer attention). A reviewer who lands the code without the design update has failed the gate, and so have you.
2. **Default TTL = 0 means default behavior = Phase 10 unchanged.** Critical for backwards-compat. Every existing test that doesn't set `CacheTtlMs` must continue to pass without modification.
3. **Cache stores POST-rewriter bytes.** The rewriter runs once on the cache-miss path; subsequent hits return cached decoded bytes directly. Do not re-invoke the rewriter on hits — wastes CPU and changes nothing.
4. **Write-invalidation is by ADDRESS RANGE OVERLAP, not by exact key match.** A write to register 105 invalidates a cached read of `[100..110]`. Use half-open interval math: write `[w, w+q)` overlaps entry `[s, s+n)` iff `w < s+n && s < w+q`.
5. **Multi-tag read range: effective TTL is `min(TTLs)`.** If any tag in the read range has TTL = 0, the whole read is uncached. Conservative-by-design.
6. **Cache lookup happens BEFORE coalescing.** Order: cache check → cache miss → coalescing check (Phase 10) → backend send (Phase 9). A cache hit short-circuits everything.
7. **`CacheKey` is structurally identical to `CoalescingKey`.** Prefer aliasing over redefinition. If the two phases land together, rename the shared type to `ReadKey` to make the joint use site neutral.
8. **MBAP TxId restoration on cache-hit responses.** The cache stores the PDU bytes (post-rewriter); on hit, build a fresh MBAP wrapper with the requesting client's `OriginalTxId`. There's no cached MBAP — the per-request TxId is supplied by the upstream pipe's request.
9. **Hot-reload of `CacheTtlMs`: flush the whole PLC cache on any tag-list change.** Tag-level granularity is technically possible but complicates the reload code path. The simple correctness move is "any tag-list change to this PLC → drop all cached entries for this PLC and let them re-populate." Document the choice.
10. **Eviction loop: `PeriodicTimer` + cancellation token.** Not `System.Timers.Timer`. The cache is `IDisposable`; the loop honours `Dispose`.
11. **Update `docs/design.md` AND `docs/kpi.md` AND `mbproxy/CLAUDE.md` AND `install/mbproxy.config.template.json` IN THE SAME PR AS THE CODE.** Doc drift is a gate fail. The architectural pivot must be visible across all reader-facing surfaces.
## Implementation clarifications discovered during this phase
The following clarifications were resolved while implementing Phase 11 — recorded here so
the next agent doesn't re-derive them:
- **`CacheKey` vs `CoalescingKey` — kept SEPARATE (no aliasing).** The two records carry
the same dimensions but live in different namespaces (`Mbproxy.Proxy.Cache` vs
`Mbproxy.Proxy.Multiplexing`). Aliasing them would couple the two phases' evolution; a
duplicate 4-field record-struct is cheap enough to justify keeping them independent.
Per-key equality is record-struct value equality; the two types are never compared.
- **`CacheEntry.LastUsedTick` is a `long`, not `ushort`.** The phase doc proposed `ushort`
but the LRU comparison needs to survive >65K touches in a long-running process. The
signed-long ticker stamp suffices for the lifetime of any reasonable deployment.
- **No-cacheable-tag PLCs skip the cache entirely.** When a PLC's resolved tag map has no
entry with `CacheTtlMs > 0`, `ProxyWorker` (and `ConfigReconciler` on reseat/add)
builds the `PerPlcContext` with `Cache = null`. The multiplexer's cache check is a
no-op on a null cache, and no eviction timer is started. The "default OFF = byte-
identical to Phase 10" regression test (`Cache_DisabledByDefault_*`) lands on this code
path.
- **Cache check runs BEFORE `EnsureBackendConnectedAsync`.** A cache hit serves the
upstream client even when the backend is currently unreachable. This is intentional and
matches the design contract bullet "cache survives backend disconnects." Verified by the
unit-level `FailedBackendConnect_OnFirstRead_DoesNotPreventLaterCacheHits_*` test.
- **FC06 / FC16 invalidation requires startAddr/qty parsing.** The multiplexer's request
parser previously only extracted start/qty for FC03/FC04. Phase 11 extends it to
FC06 (qty = 1) and FC16 (qty from request) so the InFlightRequest carries the write
span; the response path then invalidates by overlap using those values.
- **Cache eviction loop uses `PeriodicTimer`.** Per the phase doc; clamps the interval
to a 100 ms floor (operator-configurable down to that) so a misconfigured
`EvictionIntervalMs = 0` doesn't become a tight loop.
- **Write invalidation only fires on SUCCESSFUL responses.** The post-rewriter check at
the backend reader inspects the response FC byte for the exception-bit (`& 0x80`). An
exception response on FC06 / FC16 (e.g. PLC in PROGRAM mode → code 04) does NOT
invalidate — consistent with "the write didn't take effect."
- **Pre-existing flake in `BackendDisconnect_CascadesToAllUpstreams`** hardened with a
poll loop. The race window between "upstream EOF observed" and "BackendDisconnectCascades
counter incremented in `TearDownBackendAsync`" is inherent to the multiplexer's
serial-pipe-dispose loop; the test now polls for up to 1 s for the counter to reach 3.
Behaviour is unchanged.
## Cross-references
- Phase 9's multiplexer is the chokepoint that hosts the cache check: [`09-txid-multiplexing.md`](09-txid-multiplexing.md).
- Phase 10's `CoalescingKey` is the same shape as Phase 11's `CacheKey`: [`10-read-coalescing.md`](10-read-coalescing.md).
- The "not a polling/cache layer" stance that this phase pivots away from: [`../design.md`](../design.md) → "What this is" + "Purpose".
- KPI graduation target: [`../kpi.md`](../kpi.md) → Tier 1 (cache-hit-ratio joins this tier).
- Resolution rules for per-tag `CacheTtlMs` (Global Add Remove fallback + per-PLC default): [`../design.md`](../design.md) → "Hybrid tag resolution".
-107
View File
@@ -1,107 +0,0 @@
# mbproxy — implementation plan
Phase-by-phase implementation plan for the `mbproxy` service. Each phase is a self-contained work spec with explicit deliverables, tests, and a gate checklist that must be green before the next phase begins. Settled against the design plan in [`../design.md`](../design.md) on 2026-05-13.
**Briefing a subagent for a phase:** hand it exactly three documents — the phase doc, [`../design.md`](../design.md), and [`../../DL260/dl205.md`](../../DL260/dl205.md). Tell it not to read other phase docs unless its own doc lists them under "Cross-references". The phase doc IS the contract.
## Phase graph
| # | Phase | Depends on | Parallel-safe with |
|---|-------|------------|--------------------|
| 00 | [Bootstrap](00-bootstrap.md) — host + DI + Serilog + options POCOs | — | (must run first, alone) |
| 01 | [Simulator harness](01-simulator-harness.md) — pymodbus xUnit fixture | 00 | 02 |
| 02 | [BCD codec](02-bcd-codec.md) — pure encode/decode logic | 00 | 01, 03 |
| 03 | [Proxy plumbing](03-proxy-plumbing.md) — TcpListener + 1:1 byte forwarder | 00 | 02 |
| 04 | [Rewriter integration](04-rewriter-integration.md) — wire codec into proxy | 02, 03 | — |
| 05 | [Listener supervisor](05-listener-supervisor.md) — Polly auto-recovery | 03 | — |
| 06 | [Hot-reload](06-hot-reload.md) — `IOptionsMonitor` reconcile | 05 | — |
| 07 | [Status page](07-status-page.md) — Kestrel admin endpoint | 05, 06 | — |
| 08 | [Service hardening](08-service-hardening.md) — Windows service + shutdown | 04, 07 | — |
| 09 | [TxId multiplexing](09-txid-multiplexing.md) — single backend connection per PLC (post-1.0 follow-on) | 04, 05, 07 | — |
| 10 | [Read coalescing](10-read-coalescing.md) — in-flight FC03/04 dedup (post-1.0 follow-on) | 09 | — |
| 11 | [Response cache](11-response-cache.md) — short-TTL post-response cache, bounded staleness (post-1.0; **design-contract pivot**) | 10 | — |
```
┌── 01 (sim) ──┐
00 ─────┼── 02 (codec) ─┼──── 04 ───┐
└── 03 (plumbing)┴── 05 ─── 06 ─── 07 ─── 08
└─────────────────→ 09 ───→ 10 ───→ 11 (post-1.0)
```
**Phases 09, 10, and 11 are post-1.0 follow-ons**, not part of the initial 1.0 release.
- **Phase 09** rewires the connection layer to lift the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Pick it up only after Phase 08 has shipped and field experience confirms the 4-client cap is a real production problem (not just a theoretical one).
- **Phase 10** plugs into Phase 09's `InterestedParties` seam to coalesce same-key FC03/04 reads within the in-flight window. Zero post-response staleness. Worth doing only if field telemetry shows meaningful read overlap (≥ 2× duplicate-read traffic from concurrent HMIs / historians).
- **Phase 11** extends the "served without backend traffic" window from in-flight microseconds (Phase 10) to operator-configurable seconds via a per-tag TTL response cache. **This is a deliberate design-contract pivot** — the proxy stops being purely transparent and becomes an opt-in cache layer with bounded staleness. The cache is OFF by default; opting tags in is the operator's explicit acknowledgement of the staleness window. Pick up only if Phase 10's coalescing-ratio under real load reveals enough cross-poll overlap to justify staleness as a trade.
## Working with subagents
### Default: one subagent per phase, sequential
Spawn one Agent (Sonnet or Opus) per phase in order. Each agent reads exactly:
- Its own phase doc (under this directory).
- [`../design.md`](../design.md) — architecture, the source of truth.
- [`../../DL260/dl205.md`](../../DL260/dl205.md) — device quirks.
That is sufficient context. The agent must NOT invent scope beyond the phase doc's "Outputs" section. If it discovers a design-affecting issue, it must STOP and surface the issue rather than improvise — designs change in [`../design.md`](../design.md), not silently in code.
### Advanced: parallel subagents within a single phase boundary
Two phases marked "Parallel-safe with" each other can be picked up by independent subagents at the same time. The only safe parallel windows in this plan are:
- **Phase 01 ∥ Phase 02** (sim harness lives in `tests/sim/`, codec lives in `src/Mbproxy/Bcd/` — fully disjoint).
- **Phase 02 ∥ Phase 03** (codec is pure logic in `src/Mbproxy/Bcd/`; plumbing is in `src/Mbproxy/Proxy/` — disjoint).
- **Phase 01 + Phase 02 + Phase 03** all three at once is also safe (all touch different directories).
**Required pattern:**
1. Spawn each parallel agent with `isolation: "worktree"` (Agent tool's worktree mode creates an isolated git checkout).
2. Each agent gets ONE phase doc + design.md + dl205.md.
3. Each agent runs its phase gate locally before its worktree is committed.
4. Merge order: lower phase number first. Resolve conflicts manually if the agents drifted outside their declared output scope (which they shouldn't).
5. After merge, re-run the phase 00 smoke test plus both merged phases' tests to confirm no integration regression.
**Hard rules — anti-patterns that break parallel work:**
- ❌ Any two phases editing the same `.csproj` PackageReference list at the same time. Phase 00 owns the initial csproj; later phases append PackageReferences atomically and a parallel pair must coordinate via separate `<ItemGroup>` blocks or sequential merges.
- ❌ Running phase 04 in parallel with anything (it integrates two prior phases — by definition it touches their outputs).
- ❌ Running phase 06 in parallel with anything (the hot-reload reconcile inspects state from listener supervisor + rewriter + counters; it has the widest cross-cut).
- ❌ Spawning more than 3 concurrent worktree agents (review/merge overhead grows superlinearly and the value disappears).
## Phase gate template
Every phase MUST be green on all of these before its branch is merged:
1. **Build is clean.** `dotnet build src/Mbproxy/Mbproxy.csproj -c Debug` with **zero warnings**. `<TreatWarningsAsErrors>true</TreatWarningsAsErrors>` is set in phase 00 and stays set forever.
2. **All unit tests pass.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category!=E2E` is green.
3. **E2E tests pass when the simulator is available.** `dotnet test tests/Mbproxy.Tests/Mbproxy.Tests.csproj --filter Category=E2E --blame-hang-timeout 2m` is green on a machine with Python + pymodbus installed. The `--blame-hang-timeout` is mandatory — never run E2E without it. Skipped tests (due to missing simulator) don't count as failures, but ANY test added in this phase must NOT skip when the sim IS available, and every E2E test MUST carry a `[Fact(Timeout = …)]` per the Test discipline rules below.
4. **No regressions in any prior phase's tests.** The full suite stays green.
5. **No new public types beyond what the phase doc declares.** Scope creep is a gate fail. If a needed type is missing from the doc, update the doc first.
6. **No `TODO` / `FIXME` / `HACK` comments committed.** Either resolve or file in the [Deferred](#deferred) section below.
7. **Design / docs are in sync.** If a design decision changed during the phase, [`../design.md`](../design.md) is updated in the same PR — and only mirror to [`../../CLAUDE.md`](../../CLAUDE.md)'s Architecture summary if the change shifts one of the headline bullets.
8. **Phase doc itself is updated** to reflect any clarifications discovered during implementation, so the next subagent picking up the project doesn't relearn what this one learned.
## Test discipline
- **Framework:** xUnit (v3 if available, v2 otherwise) + **Shouldly** for assertions. Never `Assert.Equal(x, y)` — always `y.ShouldBe(x)`. Never `Assert.True(p)` — always `p.ShouldBeTrue("reason")`.
- **Categories:** `[Trait("Category", "Unit")]` (default; no traits needed), `[Trait("Category", "E2E")]` (needs simulator), `[Trait("Category", "Stress")]` (slow / load-bearing — opt-in only).
- **No mocks for code we own.** Exercise our types directly. Mock only at the network/file/process boundary — and prefer a real local socket / real temp file over a mock when feasible.
- **Test naming:** `MethodOrScenario_Condition_ExpectedOutcome`. Example: `BcdCodec_Decode16_Returns1234_For0x1234`.
- **One assertion per test where reasonable.** Multi-assertion tests are acceptable when they assert facets of the same scenario; never when they're really separate tests glued together.
- **Every `[Trait("Category","E2E")]` test MUST declare a hard timeout** via `[Fact(Timeout = N)]` (xUnit v3, milliseconds). **Default: `5_000` ms.** Expand per-test only when the test genuinely needs longer (concurrent bursts > 100 ops, reload-propagation debounce, graceful-shutdown drain) — and add a one-line comment explaining why. Start tight; raise only when a real test fails with a non-deadlock reason. Reason this matters: the existing fixtures use synchronous NModbus calls and stub TCP servers that **do not honor `TestContext.Current.CancellationToken`** — without `[Fact(Timeout=…)]`, a deadlock in the proxy hangs the runner indefinitely. The same rule applies to `[Trait("Category","Stress")]`. Unit tests are exempt unless they touch real sockets or processes.
- **Run E2E with a hang backstop.** The phase gate's E2E command is `dotnet test ... --filter Category=E2E --blame-hang-timeout 2m`. The `--blame-hang-timeout` is a process-level safety net in case a test's individual `Timeout` somehow doesn't fire (e.g. an unmanaged thread blocking finalization).
## Deferred
A running list of things explicitly NOT done in any current phase. When a phase reveals one, add it here so it isn't forgotten and so the deferral is visible at review time:
- *(none yet)*
## Cross-references
- Architecture and load-bearing decisions: [`../design.md`](../design.md)
- Device quirks the proxy must respect: [`../../DL260/dl205.md`](../../DL260/dl205.md)
- pymodbus simulator profile that backs e2e tests: [`../../DL260/dl205.json`](../../DL260/dl205.json)
- As-deployed PLC parameters (port 502, BCD-by-default, swap bytes, etc.): [`../../DL260/mbtcp_settings.JPG`](../../DL260/mbtcp_settings.JPG)
+4 -1
View File
@@ -165,7 +165,10 @@ if (-not (Test-Path $configDest)) {
if (-not [System.Diagnostics.EventLog]::SourceExists('mbproxy')) { if (-not [System.Diagnostics.EventLog]::SourceExists('mbproxy')) {
Write-Host "Registering Windows Event Log source 'mbproxy'..." Write-Host "Registering Windows Event Log source 'mbproxy'..."
New-EventLog -Source 'mbproxy' -LogName 'Application' # .NET API, not New-EventLog: the *-EventLog cmdlets exist only in Windows
# PowerShell 5.1, not PowerShell 7+. This call is symmetric with the
# SourceExists check above and works on every PowerShell edition.
[System.Diagnostics.EventLog]::CreateEventSource('mbproxy', 'Application')
} else { } else {
Write-Host "Windows Event Log source 'mbproxy' already registered." Write-Host "Windows Event Log source 'mbproxy' already registered."
} }
+134
View File
@@ -0,0 +1,134 @@
#!/usr/bin/env bash
#
# install.sh — install the mbproxy service on a Linux / systemd host.
#
# The Linux counterpart of install.ps1. Copies the published binary to
# /opt/mbproxy, seeds the config at /etc/mbproxy/appsettings.json (preserving any
# existing one), creates the log and bundle-cache directories and the mbproxy
# service account, installs the systemd unit, and enables + starts the service.
#
# Re-running on an already-installed service is safe (idempotent): the binary is
# refreshed, an existing /etc/mbproxy/appsettings.json is preserved, and the
# service is restarted.
#
# Usage:
# sudo ./install.sh [--publish-dir DIR] [--no-start]
#
# --publish-dir DIR directory containing the published Mbproxy binary.
# Default: <repo>/publish-out/self-contained
# --no-start install and enable the unit but do not start it.
#
set -euo pipefail
# ── 0. Settings ──────────────────────────────────────────────────────────────
SERVICE_NAME="mbproxy"
SERVICE_USER="mbproxy"
INSTALL_DIR="/opt/mbproxy"
CONFIG_DIR="/etc/mbproxy"
LOG_DIR="/var/log/mbproxy"
CACHE_DIR="/var/cache/mbproxy"
UNIT_DEST="/etc/systemd/system/${SERVICE_NAME}.service"
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
repo_root="$(dirname "$script_dir")"
publish_dir="${repo_root}/publish-out/self-contained"
start_service=1
while [[ $# -gt 0 ]]; do
case "$1" in
--publish-dir) publish_dir="$2"; shift 2 ;;
--no-start) start_service=0; shift ;;
*) echo "Unknown argument: $1" >&2; exit 2 ;;
esac
done
# ── 1. Pre-flight checks ─────────────────────────────────────────────────────
if [[ "$(id -u)" -ne 0 ]]; then
echo "install.sh must run as root (use sudo)." >&2
exit 1
fi
binary_src="${publish_dir}/Mbproxy"
if [[ ! -f "$binary_src" ]]; then
echo "Mbproxy binary not found at '${binary_src}'." >&2
echo "Run install/publish.sh first, or pass --publish-dir." >&2
exit 1
fi
unit_src="${script_dir}/mbproxy.service"
config_src="${publish_dir}/appsettings.json"
if [[ ! -f "$unit_src" ]]; then
echo "Unit file not found at '${unit_src}'." >&2
exit 1
fi
echo "Installing ${SERVICE_NAME} service..."
echo " Publish dir : ${publish_dir}"
echo " Install dir : ${INSTALL_DIR}"
echo " Config dir : ${CONFIG_DIR}"
# ── 2. Service account ───────────────────────────────────────────────────────
if ! id -u "$SERVICE_USER" >/dev/null 2>&1; then
echo "Creating service account '${SERVICE_USER}'..."
useradd --system --no-create-home --shell /usr/sbin/nologin "$SERVICE_USER"
else
echo "Service account '${SERVICE_USER}' already exists."
fi
# ── 3. Stop the service if running (so the binary can be replaced) ───────────
if systemctl is-active --quiet "$SERVICE_NAME" 2>/dev/null; then
echo "Stopping running service '${SERVICE_NAME}'..."
systemctl stop "$SERVICE_NAME"
fi
# ── 4. Directories ───────────────────────────────────────────────────────────
install -d -m 0755 "$INSTALL_DIR"
install -d -m 0755 "$CONFIG_DIR"
install -d -m 0750 -o "$SERVICE_USER" -g "$SERVICE_USER" "$LOG_DIR"
install -d -m 0750 -o "$SERVICE_USER" -g "$SERVICE_USER" "$CACHE_DIR"
# ── 5. Binary ────────────────────────────────────────────────────────────────
echo "Copying binary to '${INSTALL_DIR}/Mbproxy'..."
install -m 0755 "$binary_src" "${INSTALL_DIR}/Mbproxy"
# ── 6. Config (preserve an existing one) ─────────────────────────────────────
config_dest="${CONFIG_DIR}/appsettings.json"
if [[ -f "$config_dest" ]]; then
echo "Preserving existing config at '${config_dest}'."
elif [[ -f "$config_src" ]]; then
echo "Seeding config template to '${config_dest}'..."
install -m 0644 "$config_src" "$config_dest"
else
echo "WARNING: no appsettings.json in '${publish_dir}' — create '${config_dest}' manually." >&2
fi
# ── 7. systemd unit ──────────────────────────────────────────────────────────
echo "Installing systemd unit to '${UNIT_DEST}'..."
install -m 0644 "$unit_src" "$UNIT_DEST"
systemctl daemon-reload
systemctl enable "$SERVICE_NAME" >/dev/null
# ── 8. Start ─────────────────────────────────────────────────────────────────
if [[ "$start_service" -eq 1 ]]; then
echo "Starting service '${SERVICE_NAME}'..."
systemctl start "$SERVICE_NAME"
sleep 1
if systemctl is-active --quiet "$SERVICE_NAME"; then
echo "Service '${SERVICE_NAME}' is running."
else
echo "WARNING: service '${SERVICE_NAME}' did not reach active state." >&2
echo "Check: journalctl -u ${SERVICE_NAME} -e" >&2
fi
fi
echo ""
echo "Install complete."
echo " Config : ${config_dest}"
echo " Logs : ${LOG_DIR}"
echo " Binary : ${INSTALL_DIR}/Mbproxy"
echo ""
echo "Next steps:"
echo " 1. Edit '${config_dest}' to configure your PLC list and BCD tags."
echo " 2. Restart: sudo systemctl restart ${SERVICE_NAME}"
echo " 3. Logs: journalctl -u ${SERVICE_NAME} -f"
echo " 4. Status: http://localhost:8080/"
+29 -2
View File
@@ -99,7 +99,34 @@
// Max time (ms) to wait for in-flight PDUs to complete during graceful shutdown // Max time (ms) to wait for in-flight PDUs to complete during graceful shutdown
// (sc.exe stop / Windows Service stop signal). After this deadline the coordinator // (sc.exe stop / Windows Service stop signal). After this deadline the coordinator
// cancels remaining work and proceeds. Keep at or below the SCM wait-hint (30 s). // cancels remaining work and proceeds. Keep at or below the SCM wait-hint (30 s).
"GracefulShutdownTimeoutMs": 10000 "GracefulShutdownTimeoutMs": 10000,
// Keepalive / connection monitoring
// The DL205/DL260 ECOM does not emit TCP keepalives, so an idle backend
// socket can be silently dropped by a middlebox (switch, firewall, NAT)
// after 2-5 minutes. This section enables OS-level SO_KEEPALIVE on both
// backend and upstream sockets, and drives a periodic Modbus FC03 heartbeat
// on each idle backend socket so a dead path is detected before a real
// client request hits it. See docs/Architecture/Keepalive.md.
"Keepalive": {
// Master switch. false no SO_KEEPALIVE and no heartbeat; the proxy
// behaves exactly as a pre-keepalive build.
"Enabled": true,
// SO_KEEPALIVE: idle time (ms) before the OS sends its first probe.
"TcpIdleTimeMs": 30000,
// SO_KEEPALIVE: interval (ms) between probes once the idle time elapses.
"TcpProbeIntervalMs": 5000,
// SO_KEEPALIVE: unanswered probes before the OS declares the socket dead.
"TcpProbeCount": 4,
// Backend heartbeat: after this much backend idle (ms) the proxy issues a
// synthetic FC03 qty=1 read to keep the path warm and prove the ECOM is
// still answering Modbus. Must be greater than BackendRequestTimeoutMs.
"BackendHeartbeatIdleMs": 30000,
// FC03 PDU address the heartbeat reads. 0 = V0, valid on DL205/DL260.
"BackendHeartbeatProbeAddress": 0
}
}, },
// Resilience policies // Resilience policies
@@ -170,7 +197,7 @@
// EvictionIntervalMs background eviction tick. Scans each PLC's cache and // EvictionIntervalMs background eviction tick. Scans each PLC's cache and
// removes entries past their TTL. Defaults to 5000. // removes entries past their TTL. Defaults to 5000.
// //
// Properties (full text in docs/design.md "Response cache"): // Properties (full text in docs/Architecture/ResponseCache.md):
// * Cache hits SHORT-CIRCUIT coalescing entirely (cache coalesce backend). // * Cache hits SHORT-CIRCUIT coalescing entirely (cache coalesce backend).
// * Successful FC06/FC16 write responses invalidate every cached FC03/FC04 entry // * Successful FC06/FC16 write responses invalidate every cached FC03/FC04 entry
// whose address range OVERLAPS the write not just exact-key match. // whose address range OVERLAPS the write not just exact-key match.
@@ -0,0 +1,255 @@
// mbproxy configuration template (Linux / systemd) copy to /etc/mbproxy/appsettings.json
// and edit before starting the service.
//
// The .NET configuration loader accepts // and /* */ comments in JSON files
// (JSONC semantics) when using the default Host.CreateApplicationBuilder path.
//
// IMPORTANT: install.sh overwrites this file at the destination ONLY if no
// appsettings.json already exists there. An existing file is always preserved.
//
// This is the Linux counterpart of mbproxy.config.template.json identical except
// for the rolling-log path (/var/log/mbproxy) and a few platform notes. It is shipped
// as appsettings.json by a `dotnet publish -r linux-*` build.
{
"Mbproxy": {
// Global BCD tag list
// These tags apply to EVERY PLC by default.
// Each entry: Address (Modbus PDU address, decimal), Width (16 or 32 bits).
//
// Width 16 one register holds 4 BCD digits (09999).
// Wire value 0x1234 decodes to decimal 1234.
//
// Width 32 a CDAB-ordered register pair (Address = low word, Address+1 = high word).
// Decoded decimal = high * 10000 + low (DirectLOGIC CDAB word order).
//
// Per-PLC overrides (see Plcs[].BcdTags below):
// Add appends extra tags beyond what Global defines, or overrides a
// Global entry's Width when the same Address appears in both.
// Remove removes specific addresses from the effective set for that PLC.
// Effective set = (Global Add) Remove, resolved per PDU.
"BcdTags": {
"Global": [
// V2000 (octal) = decimal address 1024. 16-bit BCD counter.
{ "Address": 1024, "Width": 16 },
// V2040 (octal) = decimal address 1056. 32-bit BCD total at 1056/1057.
{ "Address": 1056, "Width": 32 },
// V2100 (octal) = decimal address 1088. 16-bit BCD setpoint.
//
// Phase 11: CacheTtlMs (optional) opts this tag into the response cache. With
// CacheTtlMs > 0 set, upstream clients reading this register will see values up
// to CacheTtlMs MILLISECONDS OLD explicit acknowledgement of the staleness
// window is required by enabling it. Default (omitted or 0) = cache disabled
// for this tag. The cache is OFF by default for every tag.
{ "Address": 1088, "Width": 16 /* , "CacheTtlMs": 1000 */ }
]
},
// PLC list
// Each entry maps one upstream proxy port one backend PLC.
// Upstream clients connect to ListenPort; the proxy forwards to Host:Port.
//
// IMPORTANT: H2-ECOM100 modules accept at most 4 simultaneous TCP connections.
// With the 1:1 upstreambackend model, a fifth upstream client to the same proxy
// port will cause a backend connect failure and an immediate upstream disconnect.
"Plcs": [
{
"Name": "Line1-Mixer", // Human-readable name (shown on status page and in logs)
"ListenPort": 5020, // Port the proxy listens on (upstream clients connect here)
"Host": "10.0.1.1", // PLC IP address or hostname
"Port": 502, // PLC Modbus TCP port (almost always 502)
"BcdTags": {
// Additional 32-bit tag specific to this PLC only.
"Add": [
{ "Address": 1200, "Width": 32 }
],
// Remove address 1056 from the Global list for this PLC
// (this mixer doesn't use the 32-bit BCD total).
"Remove": [ 1056 ]
}
},
{
"Name": "Line1-Conveyor",
"ListenPort": 5021,
"Host": "10.0.1.2",
"Port": 502
// No BcdTags override uses the Global set as-is.
}
// Add one entry per PLC. Ports must be unique per host. Typical fleet: 54 PLCs.
],
// Admin port
// Read-only HTTP status page.
// GET / self-contained HTML (auto-refreshes every 5 s)
// GET /status.json same data as JSON for monitoring scrapers
//
// Authentication is assumed at the network layer (trusted internal segment).
// Set to 0 to disable the admin endpoint.
"AdminPort": 8080,
// Connection timeouts
"Connection": {
// Max time (ms) to wait for a TCP connect to the PLC backend.
// Each Polly retry attempt gets its own copy of this timeout.
"BackendConnectTimeoutMs": 3000,
// Max time (ms) to wait for the PLC to respond to a forwarded PDU.
// Non-idempotent FC06/FC16 writes are one-shot the upstream client
// is disconnected immediately on timeout (no retry).
"BackendRequestTimeoutMs": 3000,
// Max time (ms) to wait for in-flight PDUs to complete during graceful shutdown
// (systemctl stop SIGTERM). After this deadline the coordinator cancels
// remaining work and proceeds. Keep at or below the unit's TimeoutStopSec.
"GracefulShutdownTimeoutMs": 10000,
// Keepalive / connection monitoring
// The DL205/DL260 ECOM does not emit TCP keepalives, so an idle backend
// socket can be silently dropped by a middlebox (switch, firewall, NAT)
// after 2-5 minutes. This section enables OS-level SO_KEEPALIVE on both
// backend and upstream sockets, and drives a periodic Modbus FC03 heartbeat
// on each idle backend socket so a dead path is detected before a real
// client request hits it. See docs/Architecture/Keepalive.md.
"Keepalive": {
// Master switch. false no SO_KEEPALIVE and no heartbeat; the proxy
// behaves exactly as a pre-keepalive build.
"Enabled": true,
// SO_KEEPALIVE: idle time (ms) before the OS sends its first probe.
"TcpIdleTimeMs": 30000,
// SO_KEEPALIVE: interval (ms) between probes once the idle time elapses.
"TcpProbeIntervalMs": 5000,
// SO_KEEPALIVE: unanswered probes before the OS declares the socket dead.
"TcpProbeCount": 4,
// Backend heartbeat: after this much backend idle (ms) the proxy issues a
// synthetic FC03 qty=1 read to keep the path warm and prove the ECOM is
// still answering Modbus. Must be greater than BackendRequestTimeoutMs.
"BackendHeartbeatIdleMs": 30000,
// FC03 PDU address the heartbeat reads. 0 = V0, valid on DL205/DL260.
"BackendHeartbeatProbeAddress": 0
}
},
// Resilience policies
"Resilience": {
// Polly retry policy for backend TCP connect attempts.
// MaxAttempts: total connect tries (including the first).
// BackoffMs: delay between each attempt (must have MaxAttempts1 entries).
"BackendConnect": {
"MaxAttempts": 3,
"BackoffMs": [ 100, 500, 2000 ]
},
// Polly recovery policy for listener bind failures.
// If a PLC's listen port can't be bound (in-use, bad IP, transient OS error),
// the supervisor retries according to this schedule.
// InitialBackoffMs: backoff per step (first N retries).
// SteadyStateMs: backoff for all subsequent retries (runs indefinitely).
"ListenerRecovery": {
"InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ],
"SteadyStateMs": 30000
},
// Phase 10 in-flight read coalescing.
//
// When two or more upstream clients (HMI / historian / engineering workstation /
// gateway) issue the SAME FC03 or FC04 read while a matching backend round-trip is
// already in flight, the proxy attaches the late arrivals to the existing in-flight
// entry and fans the single PLC response out to every attached client saving the
// ECOM's per-scan PDU budget on duplicated reads.
//
// Zero post-response staleness: coalescing operates ONLY between "first request
// sent to PLC" and "response received from PLC" (microseconds to ~10 ms typical).
// Each upstream client still sees its own MBAP transaction ID echoed correctly;
// the proxy is transparent.
//
// FC06 / FC16 writes are NEVER coalesced (non-idempotent). FC03 vs FC04 are
// separate Modbus tables and never share a coalescing key. Different unit IDs
// (multi-drop / gateway-backed setups) never coalesce.
//
// Enabled master switch. Hot-reloadable; flipping to false leaves running
// coalesced entries to drain naturally.
// MaxParties per-entry cap on attached parties. Past the cap, the next
// identical request opens a fresh backend round-trip (load-shedding
// safety valve for very fan-out-heavy fleets).
"ReadCoalescing": {
"Enabled": true,
"MaxParties": 32
}
},
// Response cache (Phase 11) opt-in bounded-staleness cache
//
// DESIGN-CONTRACT PIVOT: with caching enabled the proxy is no longer purely
// transparent. Upstream FC03/FC04 reads for cache-enabled tags may return values
// up to CacheTtlMs MILLISECONDS OLD. Operators opt tags in by setting a non-zero
// CacheTtlMs on a BcdTagOptions entry (or DefaultCacheTtlMs on a PlcOptions entry).
//
// The cache is OFF BY DEFAULT for every tag. A deployment with NO TTL config (this
// section entirely absent and no BcdTags.*.CacheTtlMs / Plcs[i].DefaultCacheTtlMs)
// behaves IDENTICALLY to a pre-Phase-11 deployment no behaviour change.
//
// AllowLongTtl gate for any CacheTtlMs > 60_000. Reload validation
// rejects configs that exceed 60 s without this opt-in,
// to prevent accidentally-stale-for-an-hour deployments.
// MaxEntriesPerPlc LRU cap per-PLC. Past this cap, the next insert evicts
// the least-recently-used entry. Defaults to 1000.
// EvictionIntervalMs background eviction tick. Scans each PLC's cache and
// removes entries past their TTL. Defaults to 5000.
//
// Properties (full text in docs/Architecture/ResponseCache.md):
// * Cache hits SHORT-CIRCUIT coalescing entirely (cache coalesce backend).
// * Successful FC06/FC16 write responses invalidate every cached FC03/FC04 entry
// whose address range OVERLAPS the write not just exact-key match.
// * Multi-tag read range: effective TTL = min(TTLs). Any tag with TTL=0 in the
// range disables caching for the whole read.
// * Cache stores POST-rewriter bytes; hits never re-invoke the BCD rewriter.
// * Tag-list hot-reload flushes the affected PLC's whole cache.
// * No persistence process restart wipes the cache.
"Cache": {
"AllowLongTtl": false,
"MaxEntriesPerPlc": 1000,
"EvictionIntervalMs": 5000
}
},
// Serilog
// Structured log output. Default: Information level, console + rolling-file.
// The console sink is captured by systemd-journald (view with `journalctl -u mbproxy`).
// In addition, when mbproxy runs as a systemd service the SyslogBridge writes Error+
// events to the local syslog with proper RFC5424 severity (wired in code, not here).
"Serilog": {
"Using": [ "Serilog.Sinks.Console", "Serilog.Sinks.File" ],
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"WriteTo": [
{
"Name": "Console",
"Args": {
"outputTemplate": "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}"
}
},
{
"Name": "File",
"Args": {
// Rolling log: one file per day, kept for 30 days, under /var/log/mbproxy
// (created by install.sh and owned by the mbproxy service account).
// Survives uninstall uninstall.sh archives logs to /var/log/mbproxy.archived-<ts>.
"path": "/var/log/mbproxy/mbproxy-.log",
"rollingInterval": "Day",
"retainedFileCountLimit": 30,
"outputTemplate": "[{Timestamp:yyyy-MM-dd HH:mm:ss.fff zzz} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}"
}
}
]
}
}
+45
View File
@@ -0,0 +1,45 @@
# systemd unit for mbproxy — the Modbus TCP BCD proxy.
#
# Installed to /etc/systemd/system/mbproxy.service by install.sh.
# The Linux counterpart of the Windows Service registered by install.ps1.
#
# Type=exec (not Type=notify): mbproxy is a leaf service that nothing orders
# against, so systemd's readiness signal is unnecessary. Type=exec marks the
# unit active once the binary is exec'd; graceful stop still works because the
# .NET generic host handles SIGTERM directly (drains in-flight requests within
# Connection.GracefulShutdownTimeoutMs).
[Unit]
Description=mbproxy — Modbus TCP BCD proxy
After=network-online.target
Wants=network-online.target
[Service]
Type=exec
ExecStart=/opt/mbproxy/Mbproxy
WorkingDirectory=/etc/mbproxy
User=mbproxy
Group=mbproxy
# Restart on crash, but not on a clean SIGTERM stop.
Restart=on-failure
RestartSec=5
# Keep above Connection.GracefulShutdownTimeoutMs (default 10 s) so the drain
# completes before systemd escalates to SIGKILL.
TimeoutStopSec=30
# Self-contained single-file publish: pin native-library extraction to a stable,
# writable directory (install.sh creates it and grants the mbproxy account access).
Environment=DOTNET_BUNDLE_EXTRACT_BASE_DIR=/var/cache/mbproxy
# Hardening. The service only needs to write its log and bundle-cache directories.
NoNewPrivileges=true
ProtectSystem=strict
ProtectHome=true
PrivateTmp=true
ReadWritePaths=/var/log/mbproxy /var/cache/mbproxy
# If any configured ListenPort is below 1024, also add:
# AmbientCapabilities=CAP_NET_BIND_SERVICE
[Install]
WantedBy=multi-user.target
+89
View File
@@ -0,0 +1,89 @@
<#
.SYNOPSIS
Publishes the Mbproxy binary in two flavours: self-contained and framework-dependent.
.DESCRIPTION
Produces two single-file builds for the requested runtime under <repo>\publish-out\:
self-contained\ ~100 MB bundles the .NET 10 + ASP.NET Core runtime;
no .NET install needed on the target.
framework-dependent\ ~1.6 MB requires the .NET 10 + ASP.NET Core runtime
preinstalled on the target.
The runtime is selected with -Rid (default win-x64). The binary is Mbproxy.exe on
Windows RIDs and Mbproxy on Linux/macOS RIDs.
Both builds use the Release configuration and inherit the publish settings declared
in src\Mbproxy\Mbproxy.csproj (PublishSingleFile=true, SelfContained=true,
IncludeNativeLibrariesForSelfExtract=true; those settings are gated on an explicit
RID, which is supplied here). The framework-dependent build overrides
SelfContained=false on the command line.
.PARAMETER Rid
.NET runtime identifier to publish for. Examples: win-x64, linux-x64.
Default: win-x64
.PARAMETER OutputDir
Root output directory. Two subfolders are created beneath it.
Default: <repo>\publish-out
.PARAMETER Clean
Delete OutputDir before publishing.
.EXAMPLE
.\publish.ps1
.\publish.ps1 -Rid linux-x64
.\publish.ps1 -Rid win-x64 -Clean
#>
[CmdletBinding()]
param(
[string]$Rid = 'win-x64',
[string]$OutputDir = (Join-Path (Split-Path -Parent $PSScriptRoot) 'publish-out'),
[switch]$Clean
)
$ErrorActionPreference = 'Stop'
$repoRoot = Split-Path -Parent $PSScriptRoot
$csproj = Join-Path $repoRoot 'src\Mbproxy\Mbproxy.csproj'
if (-not (Test-Path $csproj)) {
throw "Cannot find $csproj"
}
if ($Clean -and (Test-Path $OutputDir)) {
Write-Host "Cleaning $OutputDir" -ForegroundColor Yellow
Remove-Item -Recurse -Force $OutputDir
}
# Binary name: Windows RIDs produce an .exe, every other RID produces an extensionless ELF/Mach-O.
$exeName = if ($Rid -like 'win-*') { 'Mbproxy.exe' } else { 'Mbproxy' }
$selfContainedOut = Join-Path $OutputDir 'self-contained'
$frameworkDependentOut = Join-Path $OutputDir 'framework-dependent'
Write-Host "`n=== Publishing self-contained ($Rid, ~100 MB) ===" -ForegroundColor Cyan
& dotnet publish $csproj -c Release -r $Rid -o $selfContainedOut --nologo
if ($LASTEXITCODE -ne 0) { throw "self-contained publish failed (exit $LASTEXITCODE)" }
Write-Host "`n=== Publishing framework-dependent ($Rid, ~1.6 MB) ===" -ForegroundColor Cyan
& dotnet publish $csproj -c Release -r $Rid -p:SelfContained=false -p:PublishSingleFile=true -o $frameworkDependentOut --nologo
if ($LASTEXITCODE -ne 0) { throw "framework-dependent publish failed (exit $LASTEXITCODE)" }
function Format-Size {
param([long]$Bytes)
if ($Bytes -ge 1MB) { '{0:N1} MB' -f ($Bytes / 1MB) }
else { '{0:N1} KB' -f ($Bytes / 1KB) }
}
Write-Host "`n=== Result ($Rid) ===" -ForegroundColor Green
foreach ($flavour in 'self-contained','framework-dependent') {
$bin = Join-Path $OutputDir "$flavour\$exeName"
if (Test-Path $bin) {
$size = (Get-Item $bin).Length
Write-Host (" {0,-22} {1,10} {2}" -f $flavour, (Format-Size $size), $bin)
} else {
Write-Warning "Missing: $bin"
}
}
Write-Host ""
+82
View File
@@ -0,0 +1,82 @@
#!/usr/bin/env bash
#
# publish.sh — Linux/macOS counterpart of publish.ps1.
#
# Publishes the Mbproxy binary in two flavours for the requested runtime under
# <repo>/publish-out/:
#
# self-contained/ ~100 MB — bundles the .NET 10 + ASP.NET Core runtime;
# no .NET install needed on the target.
# framework-dependent/ ~1.6 MB — requires the .NET 10 + ASP.NET Core runtime
# preinstalled on the target.
#
# Both builds use the Release configuration and inherit the publish settings in
# src/Mbproxy/Mbproxy.csproj (those settings are gated on an explicit RID, which
# is supplied here). The framework-dependent build overrides SelfContained=false.
#
# Usage:
# ./publish.sh [-r RID] [-o OUTPUT_DIR] [--clean]
#
# -r RID .NET runtime identifier (default: linux-x64)
# -o OUTPUT_DIR root output directory (default: <repo>/publish-out)
# --clean delete OUTPUT_DIR before publishing
#
# Examples:
# ./publish.sh
# ./publish.sh -r linux-x64 --clean
#
set -euo pipefail
rid="linux-x64"
script_dir="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
repo_root="$(dirname "$script_dir")"
output_dir="$repo_root/publish-out"
clean=0
while [[ $# -gt 0 ]]; do
case "$1" in
-r) rid="$2"; shift 2 ;;
-o) output_dir="$2"; shift 2 ;;
--clean) clean=1; shift ;;
*) echo "Unknown argument: $1" >&2; exit 2 ;;
esac
done
csproj="$repo_root/src/Mbproxy/Mbproxy.csproj"
if [[ ! -f "$csproj" ]]; then
echo "Cannot find $csproj" >&2
exit 1
fi
if [[ "$clean" -eq 1 && -d "$output_dir" ]]; then
echo "Cleaning $output_dir"
rm -rf "$output_dir"
fi
# Binary name: Windows RIDs produce an .exe, every other RID an extensionless binary.
if [[ "$rid" == win-* ]]; then bin_name="Mbproxy.exe"; else bin_name="Mbproxy"; fi
self_contained_out="$output_dir/self-contained"
framework_dependent_out="$output_dir/framework-dependent"
echo
echo "=== Publishing self-contained ($rid, ~100 MB) ==="
dotnet publish "$csproj" -c Release -r "$rid" -o "$self_contained_out" --nologo
echo
echo "=== Publishing framework-dependent ($rid, ~1.6 MB) ==="
dotnet publish "$csproj" -c Release -r "$rid" \
-p:SelfContained=false -p:PublishSingleFile=true -o "$framework_dependent_out" --nologo
echo
echo "=== Result ($rid) ==="
for flavour in self-contained framework-dependent; do
bin="$output_dir/$flavour/$bin_name"
if [[ -f "$bin" ]]; then
size="$(du -h "$bin" | cut -f1)"
printf ' %-22s %8s %s\n' "$flavour" "$size" "$bin"
else
echo " WARNING: missing $bin" >&2
fi
done
echo
+4 -1
View File
@@ -122,7 +122,10 @@ if (Test-Path $InstallPath) {
if ([System.Diagnostics.EventLog]::SourceExists('mbproxy')) { if ([System.Diagnostics.EventLog]::SourceExists('mbproxy')) {
Write-Host "Removing Windows Event Log source 'mbproxy'..." Write-Host "Removing Windows Event Log source 'mbproxy'..."
try { try {
Remove-EventLog -Source 'mbproxy' # .NET API, not Remove-EventLog: the *-EventLog cmdlets exist only in
# Windows PowerShell 5.1, not PowerShell 7+. Symmetric with the
# SourceExists check above.
[System.Diagnostics.EventLog]::DeleteEventSource('mbproxy')
} catch { } catch {
Write-Warning "Could not remove Event Log source: $_" Write-Warning "Could not remove Event Log source: $_"
} }
+85
View File
@@ -0,0 +1,85 @@
#!/usr/bin/env bash
#
# uninstall.sh — remove the mbproxy service from a Linux / systemd host.
#
# The Linux counterpart of uninstall.ps1. Stops and disables the service,
# removes the systemd unit and installed files, and (unless --keep-config)
# removes the config directory. Log files are always preserved: they are moved
# to a timestamped archive so post-uninstall diagnostics remain accessible.
#
# Usage:
# sudo ./uninstall.sh [--keep-config] [--keep-user]
#
# --keep-config leave /etc/mbproxy/appsettings.json in place.
# --keep-user leave the mbproxy service account in place.
#
set -euo pipefail
SERVICE_NAME="mbproxy"
SERVICE_USER="mbproxy"
INSTALL_DIR="/opt/mbproxy"
CONFIG_DIR="/etc/mbproxy"
LOG_DIR="/var/log/mbproxy"
CACHE_DIR="/var/cache/mbproxy"
UNIT_DEST="/etc/systemd/system/${SERVICE_NAME}.service"
keep_config=0
keep_user=0
while [[ $# -gt 0 ]]; do
case "$1" in
--keep-config) keep_config=1; shift ;;
--keep-user) keep_user=1; shift ;;
*) echo "Unknown argument: $1" >&2; exit 2 ;;
esac
done
if [[ "$(id -u)" -ne 0 ]]; then
echo "uninstall.sh must run as root (use sudo)." >&2
exit 1
fi
echo "Uninstalling ${SERVICE_NAME} service..."
# ── 1. Stop + disable the service ────────────────────────────────────────────
if systemctl list-unit-files "${SERVICE_NAME}.service" >/dev/null 2>&1 \
&& [[ -n "$(systemctl list-unit-files "${SERVICE_NAME}.service" --no-legend 2>/dev/null)" ]]; then
echo "Stopping and disabling '${SERVICE_NAME}'..."
systemctl disable --now "$SERVICE_NAME" >/dev/null 2>&1 || true
fi
# ── 2. Remove the systemd unit ───────────────────────────────────────────────
if [[ -f "$UNIT_DEST" ]]; then
echo "Removing systemd unit '${UNIT_DEST}'..."
rm -f "$UNIT_DEST"
fi
systemctl daemon-reload
systemctl reset-failed "$SERVICE_NAME" >/dev/null 2>&1 || true
# ── 3. Archive logs (always preserved, never deleted) ────────────────────────
if [[ -d "$LOG_DIR" ]]; then
timestamp="$(date -u +%Y%m%dT%H%M%SZ)"
archive_dir="${LOG_DIR}.archived-${timestamp}"
echo "Archiving logs to '${archive_dir}'..."
mv "$LOG_DIR" "$archive_dir"
fi
# ── 4. Remove installed files ────────────────────────────────────────────────
rm -rf "$INSTALL_DIR" "$CACHE_DIR"
if [[ "$keep_config" -eq 1 ]]; then
echo "Keeping config at '${CONFIG_DIR}/appsettings.json' (--keep-config)."
else
rm -rf "$CONFIG_DIR"
fi
# ── 5. Remove the service account ────────────────────────────────────────────
if [[ "$keep_user" -eq 0 ]] && id -u "$SERVICE_USER" >/dev/null 2>&1; then
echo "Removing service account '${SERVICE_USER}'..."
userdel "$SERVICE_USER" 2>/dev/null || true
fi
echo ""
echo "Uninstall complete."
if compgen -G "${LOG_DIR}.archived-*" >/dev/null; then
echo "Archived logs: ${LOG_DIR}.archived-*"
fi
+576
View File
@@ -0,0 +1,576 @@
# mbproxy Multiplatform Implementation Plan
**Created:** 2026-05-15
**Status:** All six phases implemented. 413 tests green on Windows; Windows Service and
Linux systemd install E2E both green. Two findings (pymodbus-sim-on-Linux, `AddSystemd()`
notify) logged as orthogonal follow-ups. Working tree only — nothing committed.
**Working artifact** — not part of the `docs/` source-of-truth tree (per `../DOCS-GUIDE.md`).
Delete or archive once the work lands.
### Progress log
- **2026-05-15 — Phase 1 done, Gate 1 green.** RID removed from `csproj`
(single-file settings now gated on `'$(RuntimeIdentifier)' != ''`);
`publish.ps1` gained `-Rid`; `publish.sh` added. `dotnet build -c Debug` 0
warnings; `dotnet test` **398 passed / 0 failed** (baseline 325 → 398, the
Keepalive feature added tests); `win-x64``Mbproxy.exe` 100.1 MB,
`linux-x64``Mbproxy` ELF 97.2 MB. ELF launch-smoked on `10.100.0.35`:
full startup, listeners bound, `mbproxy.startup.ready` + admin endpoint up,
no errors. Box prep done (.NET SDK 10.0.300, shellcheck 0.10.0 installed).
- **2026-05-15 — Phases 2 + 3 code done (combined integrator pass).** Packages
added: `Microsoft.Extensions.Hosting.Systemd` 10.0.8,
`Serilog.Sinks.SyslogMessages` 4.1.0 (the maintained IonxSolutions package —
the bare `Serilog.Sinks.Syslog` ID is a near-abandoned 0.2.0 package; same
approved intent). New `DiagnosticSink` enum + `DiagnosticSinkSelector` (pure);
new `SyslogBridge`; `EventLogBridge` truncation extracted to a non-annotated
`EventLogMessage` type (testable cross-OS). `AddMbproxySerilog` now selects
the sink internally; `Program.cs` calls `AddSystemd()` + `AddWindowsService()`.
13 new tests. **411 passed / 0 failed on Windows**; on `10.100.0.35`
**372 passed / 39 skipped / 0 failed** — all 39 skips are simulator-backed
E2E (see finding below), every host/diagnostic/smoke test green on Linux.
- **2026-05-15 — Two cross-platform bugs found and fixed in install tooling.**
(1) `tests/sim/run-dl205-sim.ps1` was Windows-only — hardcoded venv paths
`Scripts\*.exe`; now branches `Scripts`/`.exe` vs `bin`/`` on `$IsWindows`
and adds `python3` to the interpreter candidates. (2) `install.ps1` /
`uninstall.ps1` used `New-EventLog` / `Remove-EventLog`, which exist only in
Windows PowerShell 5.1 — they fail under PowerShell 7+. Switched to the .NET
API (`[EventLog]::CreateEventSource` / `DeleteEventSource`), symmetric with
the `SourceExists` calls already in those scripts.
- **2026-05-15 — Windows Service E2E green (local, admin).** Republished
`win-x64`; `install.ps1 -Start` installs + starts the service; verified
Running/Automatic, `status.json` served, listeners bound,
`mbproxy.startup.ready` logged, Event Log source registered,
`WindowsServiceLifetime` wrote "Service started successfully" (proves the
process runs under the SCM). `uninstall.ps1` stopped/deleted the service,
archived logs, removed the Event Log source. Box left clean. (A forced
`EventLogBridge` Error+ write was not pursued — `Emit` is unchanged code,
covered by `EventLogMessageTests`; sink selection is covered by
`DiagnosticSinkSelectorTests`.)
- **2026-05-15 — Linux systemd E2E done.** The `linux-x64` ELF runs under a
real systemd unit on `10.100.0.35`: starts, binds listeners, serves the
admin endpoint, and `systemctl stop` → graceful SIGTERM drain
(`mbproxy.shutdown.complete` in the journal). `Type=notify` does not work
(see Findings) → Phase 5 will ship `Type=exec`. Box prep this session:
`dotnet-sdk-10.0`, `shellcheck`, `python3-venv`, pwsh 7.6.1 (dotnet global
tool), pymodbus 3.13.0 venv.
- **2026-05-15 — Phases 46 done.** Phase 4: new `install/mbproxy.linux.config.template.json`
(Unix log path `/var/log/mbproxy`, systemd-oriented comments); `csproj` links the
platform-correct template into the published `appsettings.json` by RID
(`win-*`/RID-less → Windows, else Unix) — verified by publishing both RIDs;
`MbproxyOptionsBindingTests` extended to load + schema-validate both templates
(now 413 tests on Windows). Phase 5: `install/mbproxy.service` (`Type=exec`,
hardened, `mbproxy` service account), `install/install.sh`, `install/uninstall.sh`
`shellcheck` clean; install→active→`status.json` served→uninstall→clean E2E
passed on `10.100.0.35`. Phase 6: `README.md`, `mbproxy/CLAUDE.md`,
`../CLAUDE.md`, `docs/Operations/Configuration.md`, `docs/Reference/LogEvents.md`,
`docs/Operations/Troubleshooting.md`, `docs/Architecture/Overview.md`,
`docs/Features/HotReload.md` updated for the dual-platform reality.
### Findings
- **Linux full run: 374 passed / 37 failed / 0 skipped.** With the simulator
launcher fixed and pymodbus provisioned, the simulator-backed E2E tests now
*run* on Linux (0 skipped) but **37 fail** with `IOException: Broken pipe`
(`SocketException`) when the NModbus client writes through the proxy. The
failures are broad across all simulator-backed E2E (cache, forwarding,
rewriter, supervision). **Not a Phases 13 regression:** the multiplatform
work touches only build config, diagnostic sinks, and host registration —
none of the Modbus proxy data path. The same 37 tests pass on Windows
(411/411), and every non-E2E test — including all 13 new diagnostic tests —
passes on Linux. **Root cause isolated:** the `SimulatorSmokeTests` — which
connect *directly to the pymodbus simulator with no proxy in the path* — also
fail (TCP connect error). So the fault is the pymodbus 3.13.0 simulator
itself on this box, not mbproxy's proxy code. Likely pymodbus 3.13.0 vs
Python 3.13.5 (both very new), or the box's Docker-host networking. Treated
as a **separate investigation** (pymodbus-simulator-on-Linux), entirely
orthogonal to the multiplatform service work — see the session report.
- The `run-dl205-sim.ps1` idempotency check keys on `Test-Path $venvDir` only;
a venv left structurally broken by a killed run (no `bin/`) is not detected
and re-created. Pre-existing latent gap, not platform-specific — noted, not
fixed (out of scope; a clean run is unaffected).
- **`AddSystemd()` does not deliver `sd_notify(READY=1)` here → Phase 5 uses
`Type=exec`.** mbproxy runs correctly under systemd (starts, binds, serves,
and SIGTERM → graceful drain all work — verified in the journal), but a
`Type=notify` unit never receives `READY=1` and times out. Isolated step by
step: `SystemdHelpers.IsSystemdService()` correctly returns `True` under
systemd; a *minimal* `Host.CreateApplicationBuilder()` + `AddSystemd()` host
reproduces the failure; both a `systemd-run` transient unit and a real
`Type=notify` unit file fail identically. So it is **not an mbproxy bug**
it is a `HostApplicationBuilder` + `Microsoft.Extensions.Hosting.Systemd`
10.0.8 (minimal-hosting) issue. **Resolution:** the Phase 5 unit uses
`Type=exec` — mbproxy is a leaf service that nothing orders against, so the
readiness signal is unnecessary; `Type=exec` + the generic host's built-in
POSIX `SIGTERM` handling (independent of `SystemdLifetime`) gives a fully
working unit with `Restart=on-failure`. `AddSystemd()` stays in `Program.cs`
(correct, documented, forward-compatible, harmless). Root-causing the .NET
notify gap is logged as a separate follow-up.
A plan to make mbproxy run on Linux (and incidentally macOS) as a first-class
target while keeping the Windows Service + Event Log behavior intact and adding
systemd + journald/syslog equivalents.
The hosting model (`Host.CreateApplicationBuilder` + `IHostedService` + Kestrel)
is already portable, so the work is narrow: generalize the build, abstract one
diagnostic sink, add one package + one call, and add Linux tooling/docs.
---
## 0. Test Environments
Both platforms can be exercised fully — no environment is simulated or
deferred.
### 0.1 Windows (the dev box — local)
The dev box runs **with administrator rights**, so every Windows gate runs
locally with no separate test machine:
- `install.ps1` (requires elevation) installs the real Windows Service.
- The Event Log source `mbproxy` can be registered and `EventLogBridge` writes
verified against the Application log.
- Install → start → stop → uninstall is a full local round-trip.
> Windows Service E2E mutates machine state (a registered service + Event Log
> source). It is **integrator-only** and the integrator always runs
> `uninstall.ps1` to leave the box clean after each gate.
### 0.2 Linux
**Host:** `dohertj2@10.100.0.35` — Debian 13 (trixie), amd64, kernel 6.12,
hostname `DOCKER`. systemd 257.
- **Access:** passwordless SSH from the Windows dev box; passwordless `sudo`
(verified 2026-05-15).
- **Reachable** on `10.100.0.35` (also `10.50.0.35`, `10.200.0.35`).
- **One-time prep** (run once before Wave 1 gates):
```
ssh dohertj2@10.100.0.35 'sudo apt-get update && \
sudo apt-get install -y dotnet-sdk-10.0 shellcheck'
```
`dotnet-sdk-10.0` candidate is `10.0.203` — matches the `net10.0` target.
- **Docker is installed** on the box (the user is in the `docker` group). Use
ephemeral Debian containers to isolate per-subagent E2E runs so parallel
Wave-4 agents don't collide on the host's systemd / ports (see section 3,
rule 8).
**How the integrator uses the box per gate:**
- Push the integration branch (or `rsync` the worktree) to the box, then run
`dotnet build` / `dotnet test` / `dotnet publish -r linux-x64` over SSH.
- Run the *actual* `linux-x64` ELF binary, the systemd unit, and `shellcheck`
here — Windows can cross-*publish* a `linux-x64` binary but cannot *run* or
service-host it.
> The box is a **shared mutable resource**. Host-level mutations (apt installs,
> `systemctl` on the real host, privileged-port binds) are integrator-only and
> run serially between waves. Subagents that need Linux E2E use throwaway
> Docker containers, never the host's init system directly.
---
## 1. Scope
**In scope**
- Linux (`linux-x64`) as a supported runtime target alongside `win-x64`.
- systemd integration (`Type=notify`, sd_notify readiness, SIGTERM drain).
- A Linux-appropriate error-event diagnostic sink (syslog, severity-mapped).
- RID-agnostic build + dual-RID publish tooling.
- Linux install tooling (systemd unit + shell scripts).
- Docs/README/CLAUDE.md updates.
**Out of scope (state explicitly in docs)**
- macOS `launchd` integration — mbproxy will *run* on macOS as a console
process but ships no service-manager integration.
- ARM RIDs (`linux-arm64`) — the build will not *forbid* them, but they are
untested.
- Container/Docker packaging — separate future effort.
**Locked design decisions**
- Reference `Microsoft.Extensions.Hosting.WindowsServices` *and*
`Microsoft.Extensions.Hosting.Systemd` unconditionally; both packages are
portable and both helpers self-detect their host. No conditional
`<PackageReference>`.
- All Windows API calls (`System.Diagnostics.EventLog`) stay behind
`OperatingSystem.IsWindows()` + `[SupportedOSPlatform("windows")]`; CA1416
(already enforced via `TreatWarningsAsErrors`) is the safety net.
- Diagnostic sink selection happens **once**, at the composition root
(`AddMbproxySerilog`). No OS branching anywhere else.
- Prefer **new files** over editing shared files, to keep parallel work
conflict-free.
- **Linux error-event sink: `Serilog.Sinks.Syslog`** (decided 2026-05-15).
Error+ events get RFC5424 severity mapping on Linux, mirroring the Windows
Event Log behavior where Error+ is surfaced distinctly.
`DiagnosticSinkSelector` returns `EventLog | Syslog | None`.
---
## 2. Phase Breakdown
Each phase lists its **owned file set** (the parallel-safety contract),
changes, tests, and a **gate** that must be green before the next phase starts.
### Phase 1 — Build & publish generalization (foundation)
**Objective:** Remove the hardcoded RID so the project builds/publishes for any
runtime; keep the Windows output byte-identical.
**Owned files**
- `src/Mbproxy/Mbproxy.csproj`
- `install/publish.ps1`
- `install/publish.sh` *(new)*
**Changes**
- `Mbproxy.csproj`: delete `<RuntimeIdentifier>win-x64</RuntimeIdentifier>`
from the Release `PropertyGroup`; keep `PublishSingleFile` / `SelfContained`
/ `IncludeNativeLibrariesForSelfExtract`. RID becomes a publish-time `-r`
argument.
- `publish.ps1`: add a `-Rid` parameter (default `win-x64`), keep the
two-flavor logic.
- `publish.sh`: Linux counterpart producing `linux-x64` self-contained +
framework-dependent builds.
- (The RID-conditioned `appsettings.json` content item is Phase 4; in Phase 1
just confirm the build works without a baked RID.)
**Tests**
- No xunit tests (build-config change). Gate is publish success on both RIDs.
**Gate 1**
- `dotnet build -c Debug` green; `dotnet test` full suite green (unchanged
count).
- `dotnet publish -c Release -r win-x64` produces a single-file `Mbproxy.exe`
(same size class as before).
- `dotnet publish -c Release -r linux-x64` produces a single-file `Mbproxy`
ELF binary. Cross-published from the Windows dev box; the ELF is then copied
to `10.100.0.35` and confirmed to launch (`./Mbproxy --version`-class smoke).
- Zero new analyzer warnings.
---
### Phase 2 — Diagnostic sink abstraction
**Objective:** Make error-event delivery a platform-selected sink. Windows
keeps `EventLogBridge`; Linux gets a syslog sink.
**Owned files**
- `src/Mbproxy/Diagnostics/DiagnosticSinkSelector.cs` *(new — pure selection
logic)*
- `src/Mbproxy/Diagnostics/SyslogBridge.cs` *(new)*
- `src/Mbproxy/Diagnostics/EventLogBridge.cs` *(minor: extract the 32 KB
truncation helper into a testable static method)*
- `src/Mbproxy/HostingExtensions.cs` *(only `AddMbproxySerilog`)*
- `src/Mbproxy/Mbproxy.csproj` *(add `Serilog.Sinks.Syslog` package)*
- New test files (see below)
> `HostingExtensions.cs` and `Mbproxy.csproj` are also touched by Phase 3.
> **Phases 2 and 3 must not run in parallel** (see section 3). They are
> sequential.
**Changes**
- `DiagnosticSinkSelector` — a pure function taking
`(bool isWindows, bool isWindowsService, bool isSystemd)` and returning an
enum (`EventLog | Syslog | None`). No I/O, fully unit-testable.
- `SyslogBridge`: Serilog `ILogEventSink` wrapping `Serilog.Sinks.Syslog`,
active for Error+ only, mirroring `EventLogBridge`'s contract (silent no-op
if syslog unavailable).
- `AddMbproxySerilog`: replace the `addEventLogBridge` bool parameter with a
`DiagnosticSinkSelector` result; wire the chosen sink. Keep the
`OperatingSystem.IsWindows()` guard around `EventLogBridge`.
- Extract `EventLogBridge`'s message-truncation into
`internal static string TruncateToEventLogLimit(string)` so it can be tested
OS-independently.
**Tests** (`tests/Mbproxy.Tests/Diagnostics/`)
- `DiagnosticSinkSelectorTests` — table-driven: Windows+service→`EventLog`;
Windows console→`None`; Linux+systemd→`Syslog`; Linux console→`None`;
macOS→`None`.
- `EventLogBridgeTests``[Trait("Category","Unit")]`, Windows-guarded facts:
source-missing → silent no-op; truncation helper caps at 32 KB and appends
`...` (this fact runs on all OSes since the helper is pure).
- `SyslogBridgeTests` — Error+ filter; no-throw when transport unavailable.
**Gate 2**
- Full test suite green on Windows (local); full suite green on Linux —
integrator runs `dotnet test` over SSH on `10.100.0.35`.
- `EventLogBridge` emits to the Application log — verified locally via a real
Windows Service install (`install.ps1`, admin rights available), then
`uninstall.ps1` to clean up.
- CA1416: zero warnings.
---
### Phase 3 — Service host integration (systemd)
**Objective:** Register both init-system integrations; the host correctly
reports readiness to whichever launched it.
**Owned files**
- `src/Mbproxy/Program.cs`
- `src/Mbproxy/HostingExtensions.cs` *(call-site update only)*
- `src/Mbproxy/Mbproxy.csproj` *(add `Microsoft.Extensions.Hosting.Systemd`)*
**Changes**
- `csproj`: add
`<PackageReference Include="Microsoft.Extensions.Hosting.Systemd" />` (pin to
the 10.0.x line matching the existing Windows-services package).
- `Program.cs`: call `builder.Services.AddSystemd();` alongside
`AddWindowsService();`. Compute `isSystemd` via
`SystemdHelpers.IsSystemdService()` and feed `DiagnosticSinkSelector`
together with `isWindowsService`.
- Confirm SIGTERM → host shutdown → existing
`Connection.GracefulShutdownTimeoutMs` drain path works (it does — POSIX
signal handling is built into the generic host; just verify).
**Tests** (`tests/Mbproxy.Tests/HostSmokeTests.cs` — extend existing file)
- `HostSmoke_RegistersBothServiceIntegrations_StartsAndStops` — builds the host
with both `AddWindowsService` + `AddSystemd`, asserts no throw, asserts
`mbproxy.startup.ready` still logged.
- Existing two smoke tests must remain green.
**Gate 3**
- Full suite green on Windows (local) and Linux (`10.100.0.35` via SSH).
- Windows Service E2E, run locally with admin rights: `install.ps1` → service
starts, logs `mbproxy.startup.ready` + writes to Event Log, `Stop-Service`
drains cleanly, `uninstall.ps1` removes it. **No regression** in Windows
behavior is the hard requirement of this gate.
- Linux systemd E2E on `10.100.0.35`**done.** The `linux-x64` binary runs
under a real systemd unit: it starts, binds listeners, serves the admin
endpoint, and `systemctl stop` (SIGTERM) drains gracefully
(`mbproxy.shutdown.complete` in the journal). `Type=notify` was found not to
deliver `READY=1` (Findings) → the Phase 5 unit uses `Type=exec`, under which
the service is fully functional.
---
### Phase 4 — Config & filesystem portability
**Objective:** No Windows-only paths in the shipped/installed config.
**Owned files**
- `install/mbproxy.config.template.json` *(Windows — keep `C:\ProgramData\...`
path)*
- `install/mbproxy.linux.config.template.json` *(new — `/var/log/mbproxy/...`,
Linux syslog `Using` entry)*
- `src/Mbproxy/Mbproxy.csproj` *(condition the linked `appsettings.json`
content item by `$(RuntimeIdentifier)`)*
> Touches `csproj`. Must run after Phase 3's csproj edit is merged (sequential
> w.r.t. csproj), but is otherwise independent of Phase 5/6.
**Changes**
- New Linux template: log path `/var/log/mbproxy/mbproxy-.log`; Serilog
`Using` array includes the syslog sink; comment header points at
`/etc/mbproxy/appsettings.json`.
- `csproj`: link the win template for `win-*` RIDs and the linux template for
`linux-*` RIDs into the published `appsettings.json` (RID-conditioned
`<Content>` items).
**Tests** (`tests/Mbproxy.Tests/Options/`)
- Extend `MbproxyOptionsBindingTests`: load **each** shipped template through
the config binder + `MbproxyOptionsValidator`; assert both bind and validate
cleanly. Catches a malformed Linux template at build time.
**Gate 4**
- Both templates bind + validate (new test green).
- `dotnet publish -r linux-x64` ships the Linux template as `appsettings.json`;
`-r win-x64` ships the Windows one. Verify by inspecting publish output.
---
### Phase 5 — Linux install tooling
**Objective:** Parity with `install.ps1` for systemd hosts.
**Owned files** (all new, fully disjoint from all other phases)
- `install/mbproxy.service` — systemd unit, **`Type=exec`** (not `Type=notify`
see Findings: `AddSystemd()` does not deliver `READY=1` for the minimal
hosting model), `Restart=on-failure`, `User=mbproxy`, `ExecStart` pointing at
the installed binary; sets `DOTNET_BUNDLE_EXTRACT_BASE_DIR`.
- `install/install.sh` — creates `mbproxy` service account, lays down binary +
`/etc/mbproxy/appsettings.json` (preserve-if-exists, matching `install.ps1`
semantics), creates `/var/log/mbproxy`, installs + `systemctl enable --now`.
- `install/uninstall.sh``systemctl disable --now`, archives logs (mirror the
`.archived-<ts>` convention), removes unit.
**Tests**
- Not xunit. Gate = `shellcheck` clean + a dry-run inside a throwaway Debian
container on `10.100.0.35`.
**Gate 5**
- `shellcheck install/*.sh` clean — run on `10.100.0.35` (shellcheck installed
in the one-time prep).
- End-to-end on `10.100.0.35`, inside a throwaway Debian container:
`install.sh` → service active → proxy answers Modbus on a configured port →
`uninstall.sh` → service gone, logs archived. Container isolation keeps the
`mbproxy` service account / unit off the real host.
---
### Phase 6 — Documentation
**Objective:** Docs reflect dual-platform reality; doctrine in `DOCS-GUIDE.md`
respected.
**Owned files**
- `README.md` — rewrite "Hard constraints / prerequisites" (drop "No Linux or
Docker support"); add Linux install path; document both publish flavors ×
both RIDs.
- `docs/Operations/Configuration.md` — both config templates, log-path
differences, syslog vs Event Log.
- `docs/Operations/Troubleshooting.md``journalctl` guidance alongside Event
Viewer.
- `docs/Architecture/Overview.md` — note dual init-system hosting (only if it
shifts a headline bullet).
- `docs/Reference/LogEvents.md` — note Error+ events route to Event Log
(Windows) / syslog (Linux).
- `mbproxy/CLAUDE.md` — correct the implied Windows-only framing.
- `wwtools/CLAUDE.md` — broaden the mbproxy index row if the task→tool mapping
changed.
**Tests**
- Markdown link-check across touched files.
**Gate 6**
- All internal doc links resolve.
- README "Hard constraints" no longer contradicts the shipped tooling.
---
## 3. Parallel Subagent Execution Plan
### Dependency graph
```
Phase 1 (build) ──> Phase 2 (diagnostics) ──> Phase 3 (host) ──┬─> Phase 4 (config)
├─> Phase 5 (install)
└─> Phase 6 (docs)
```
Phases 2 and 3 are **strictly sequential**: Phase 3 calls the new
`AddMbproxySerilog` signature Phase 2 defines, and both edit
`HostingExtensions.cs` + `csproj`. Phases 4, 5, 6 are **mutually independent**
and parallelizable once Phase 3 is merged.
### Wave plan
| Wave | Phases | Agents | Mode |
| ---- | --------- | ------------------- | ----------------------------------------------- |
| W1 | Phase 1 | 1 agent | Single — touches `csproj` |
| W2 | Phase 2 | 1 agent | Single — touches `csproj` + `HostingExtensions` |
| W3 | Phase 3 | 1 agent | Single — touches `csproj` + `HostingExtensions` + `Program.cs` |
| W4 | 4, 5, 6 | 3 agents (parallel) | Parallel — disjoint file sets |
> Phase 4 touches `csproj` but no other W4 phase does, so within W4 the file
> sets are still disjoint. Safe.
### File-ownership matrix (the parallel-safety contract)
| File | P1 | P2 | P3 | P4 | P5 | P6 |
| --------------------------------------------- | -- | -- | -- | -- | -- | -- |
| `Mbproxy.csproj` | x | x | x | x | | |
| `HostingExtensions.cs` | | x | x | | | |
| `Program.cs` | | | x | | | |
| `Diagnostics/*` (new + EventLogBridge) | | x | | | | |
| `install/publish.*` | x | | | | | |
| `install/*.config.template.json` | | | | x | | |
| `install/install.sh`, `uninstall.sh`, `.service` | | | | | x | |
| `tests/**` | | x | x | x | | |
| docs / READMEs / CLAUDE.md | | | | | | x |
No column in W4 (P4/P5/P6) shares a row. Confirmed conflict-free.
### Subagent rules (enforce in every dispatch prompt)
1. **One git worktree per subagent** — dispatch each `Agent` call with
`isolation: "worktree"`. Physical isolation means even a stray edit can't
corrupt a sibling's tree.
2. **Owned-file contract** — each subagent is told its exact owned file set
from the matrix and instructed to edit nothing outside it. A subagent that
discovers it needs an out-of-set file must stop and report, not edit.
3. **No intra-wave API coupling** — subagents in the same wave may only depend
on public APIs from *already-merged* prior waves, never on a sibling's
in-progress work. (This is why P2→P3 are separate waves, not parallel.)
4. **Tests ship with code** — the subagent that writes a phase's code also
writes that phase's tests and runs `dotnet test` green *in its own
worktree* before reporting done. No separate "test agent."
5. **Integrator merges in declared order** — the main agent merges each
worktree, runs the full build + test suite, and only then declares the
phase gate met. A failed gate blocks the next wave.
6. **High-contention files are single-agent-only**`csproj`,
`HostingExtensions.cs`, `Program.cs`, `CLAUDE.md` are never edited by two
agents in the same wave (the matrix guarantees this).
7. **Prefer new files**`DiagnosticSinkSelector.cs`, `SyslogBridge.cs`,
`mbproxy.linux.config.template.json`, the shell scripts, the unit file are
all new — new files can't merge-conflict, maximizing safe parallelism.
8. **Shared test hosts are integrator-only for mutations** — subagents may run
`dotnet build` / `dotnet test` (read-mostly) but must **not** install a
Windows Service, register an Event Log source, or `systemctl` against the
real `10.100.0.35` host. Service-level E2E is the integrator's job at gate
time; if a subagent needs Linux E2E it spins an ephemeral Docker container
on the box (named per-agent, `--rm`) so parallel agents never collide on
ports, the init system, or service accounts.
### Merge protocol per wave
```
for each wave:
dispatch agent(s) with isolation: worktree + owned-file list
on completion:
integrator: merge worktree(s) in matrix order
integrator: dotnet build -c Debug (must be green)
integrator: dotnet test (green, count >= prior)
integrator: dotnet publish -r win-x64 AND -r linux-x64 (must succeed)
integrator: verify phase-specific gate checklist
gate green? -> next wave. gate red? -> fix in a single-agent pass, re-gate.
```
---
## 4. Cross-Cutting Test Strategy
- **Existing baseline (325 = 282 unit + 43 E2E) must never regress.** Every
gate re-runs the full suite.
- **New tests target pure logic**`DiagnosticSinkSelector` is a pure function
precisely so platform-selection is testable without being a service. Highest-
value new test.
- **OS-conditional tests** use `[Trait]` + a runtime `OperatingSystem.IsWindows()`
skip so the suite is green on both Windows and Linux.
- **Both platforms are exercised every gate, no simulation.** Windows runs
locally (admin rights → real Windows Service install). Linux runs on
`dohertj2@10.100.0.35` (Debian 13, systemd 257) — the integrator drives
`dotnet build` / `dotnet test` / publish / systemd E2E over SSH.
- **CI** (if/when a pipeline exists): add a `linux-x64` build+test leg, ideally
pointed at the same box or an equivalent image. Until then the integrator's
per-gate SSH run on `10.100.0.35` is the Linux leg.
- **CA1416 platform analyzer** is treated as a test — `TreatWarningsAsErrors`
already fails the build if a Windows API escapes its guard.
---
## 5. Risk Register
| Risk | Phase | Mitigation |
| --------------------------------------------- | ----- | -------------------------------------------------------------------------- |
| Windows Service behavior regresses unnoticed | P3 | Gate 3 mandates a real Windows Service install/start/stop smoke check |
| `Serilog.Sinks.Syslog` version drift | P2 | Pin the version; `SyslogBridge` is isolated behind `DiagnosticSinkSelector` |
| Linux publish ships Windows config path | P4 | RID-conditioned `<Content>` item + `MbproxyOptionsBindingTests` on both templates |
| Self-extracting single-file temp-dir perms | P1/P5 | Document + set `DOTNET_BUNDLE_EXTRACT_BASE_DIR` in the systemd unit |
| Two agents racing `csproj` | all | Matrix forbids it — `csproj` edited only in single-agent waves W1W3 + lone P4 |
| Hidden Windows path elsewhere in code | all | `Grep` sweep for `C:\\`, `ProgramData`, `\\\\` before Gate 6 |
| Parallel Wave-4 agents collide on the shared `10.100.0.35` host | W4 | Rule 8 — service-level E2E is integrator-only and serial; subagent E2E uses per-agent `--rm` Docker containers |
| Windows Service E2E leaves stale service/Event Log source | P2/P3 | Integrator always runs `uninstall.ps1` after each Windows gate |
---
## 6. Deliverable Summary
- **3 modified source files** (`csproj`, `HostingExtensions.cs`, `Program.cs`)
+ **3 new** (`DiagnosticSinkSelector.cs`, `SyslogBridge.cs`, and the
truncation-helper extraction in `EventLogBridge.cs`).
- **2 new packages** (`Microsoft.Extensions.Hosting.Systemd`,
`Serilog.Sinks.Syslog`).
- **6 new install/tooling files** (`publish.sh`, Linux config template,
`mbproxy.service`, `install.sh`, `uninstall.sh`).
- **~68 new tests** across 3 new/extended test files; baseline 325 preserved.
- **7 doc files** updated.
- **4 waves**, max 3 concurrent subagents, conflict-free by construction.
+41 -8
View File
@@ -25,11 +25,11 @@ namespace Mbproxy.Admin;
/// ///
/// <para>Routes: exactly two — <c>GET /</c> (HTML) and <c>GET /status.json</c> (JSON).</para> /// <para>Routes: exactly two — <c>GET /</c> (HTML) and <c>GET /status.json</c> (JSON).</para>
/// ///
/// <para><b>Phase 12 (W1.5)</b> — was previously also registered as <see cref="IHostedService"/>, /// <para>Registered as a plain singleton (not <see cref="IHostedService"/>) so
/// but the host's automatic stop ordering (reverse of registration) ran admin.StopAsync /// <see cref="Proxy.ProxyWorker"/> can drive its lifecycle explicitly. This is required to
/// BEFORE ProxyWorker.StopAsync, which broke the design's "drain THEN stop admin" guarantee /// honour the design contract that the in-flight drain finishes BEFORE admin stops; an
/// and caused a double-stop with the now-deleted <c>ShutdownCoordinator</c>. Now a plain /// IHostedService registration would let the host stop admin in reverse-registration order
/// singleton with explicit lifecycle calls from ProxyWorker.</para> /// and break that ordering.</para>
/// </summary> /// </summary>
internal sealed partial class AdminEndpointHost : IAsyncDisposable internal sealed partial class AdminEndpointHost : IAsyncDisposable
{ {
@@ -44,6 +44,13 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
// Protects concurrent Start/Stop calls (hot-reload + StopAsync racing). // Protects concurrent Start/Stop calls (hot-reload + StopAsync racing).
private readonly SemaphoreSlim _lock = new(1, 1); private readonly SemaphoreSlim _lock = new(1, 1);
// Idempotency flag for DisposeAsync. ProxyWorker.StopAsync calls our StopAsync
// explicitly; the DI container then disposes the singleton on host shutdown. Without
// this flag the second pass would Dispose `_lock` twice and re-dispose the change
// registration (both currently safe but symmetry with PlcMultiplexer prevents future
// regression).
private volatile bool _disposed;
// Current configured port — used to detect changes on hot-reload. // Current configured port — used to detect changes on hot-reload.
private int _currentPort; private int _currentPort;
@@ -70,15 +77,35 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
// Subscribe to config changes: if AdminPort changes, re-bind. // Subscribe to config changes: if AdminPort changes, re-bind.
_optionsChangeRegistration = _optionsMonitor.OnChange(opts => _optionsChangeRegistration = _optionsMonitor.OnChange(opts =>
{ {
// Short-circuit if disposal has already started. The OnChange callback can
// fire (and the Task.Run can be queued) AFTER StopAsync disposed the change
// registration but BEFORE DI ran DisposeAsync; without this guard the lambda
// would resurrect a fresh Kestrel app on the new port after the host already
// considered admin shut down.
if (_disposed) return;
int newPort = opts.AdminPort; int newPort = opts.AdminPort;
if (newPort == _currentPort) return; // Only care about AdminPort changes. if (newPort == _currentPort) return; // Only care about AdminPort changes.
// Fire-and-forget: re-bind is async; we can't await in OnChange. // Fire-and-forget: re-bind is async; we can't await in OnChange.
_ = Task.Run(async () => _ = Task.Run(async () =>
{ {
await _lock.WaitAsync().ConfigureAwait(false); // Re-check after the queue: a concurrent StopAsync/DisposeAsync may have
// landed between OnChange firing and the threadpool picking us up.
if (_disposed) return;
try try
{ {
await _lock.WaitAsync().ConfigureAwait(false);
}
catch (ObjectDisposedException)
{
// _lock disposed mid-queue — host is shutting down. Drop silently.
return;
}
try
{
if (_disposed) return;
if (newPort == _currentPort) return; // double-check under lock if (newPort == _currentPort) return; // double-check under lock
// Stop the old app. // Stop the old app.
@@ -91,7 +118,8 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
} }
finally finally
{ {
_lock.Release(); try { _lock.Release(); }
catch (ObjectDisposedException) { /* dispose race */ }
} }
}); });
}); });
@@ -199,14 +227,19 @@ internal sealed partial class AdminEndpointHost : IAsyncDisposable
public async ValueTask DisposeAsync() public async ValueTask DisposeAsync()
{ {
if (_disposed) return;
_disposed = true;
_optionsChangeRegistration?.Dispose(); _optionsChangeRegistration?.Dispose();
_lock.Dispose(); _optionsChangeRegistration = null;
if (_app is { } app) if (_app is { } app)
{ {
_app = null; _app = null;
await app.DisposeAsync().ConfigureAwait(false); await app.DisposeAsync().ConfigureAwait(false);
} }
try { _lock.Dispose(); } catch (ObjectDisposedException) { /* race-safe */ }
} }
// ── Logging ────────────────────────────────────────────────────────────── // ── Logging ──────────────────────────────────────────────────────────────
@@ -6,9 +6,9 @@ namespace Mbproxy.Admin;
/// Reads <see cref="AssemblyInformationalVersionAttribute"/> once at startup and caches the /// Reads <see cref="AssemblyInformationalVersionAttribute"/> once at startup and caches the
/// result as a string. Used for the <c>service.version</c> field on the status page. /// result as a string. Used for the <c>service.version</c> field on the status page.
/// ///
/// <para>Note: <see cref="Assembly.Location"/> is unreliable under single-file publish /// <para>Note: <see cref="Assembly.Location"/> is unreliable under single-file publish.
/// (Phase 08). We use <c>Assembly.GetExecutingAssembly().GetCustomAttribute&lt;&gt;()</c> /// We use <c>Assembly.GetExecutingAssembly().GetCustomAttribute&lt;&gt;()</c> which works
/// which works correctly regardless of publish mode.</para> /// correctly regardless of publish mode.</para>
/// </summary> /// </summary>
internal sealed class AssemblyVersionAccessor internal sealed class AssemblyVersionAccessor
{ {
+31 -13
View File
@@ -3,7 +3,7 @@ using System.Text.Json.Serialization;
namespace Mbproxy.Admin; namespace Mbproxy.Admin;
// ── Wire DTOs for GET /status.json ─────────────────────────────────────────── // ── Wire DTOs for GET /status.json ───────────────────────────────────────────
// Field names must match design.md "Status page" tables EXACTLY (camelCase via // Field names must match docs/Operations/StatusPage.md tables EXACTLY (camelCase via
// JsonKnownNamingPolicy.CamelCase on the source-gen context). // JsonKnownNamingPolicy.CamelCase on the source-gen context).
/// <summary> /// <summary>
@@ -58,7 +58,13 @@ public sealed record PlcPdusStatus(
long Forwarded, long Forwarded,
FcCounts ByFc, FcCounts ByFc,
long RewrittenSlots, long RewrittenSlots,
long PartialBcdWarnings); long PartialBcdWarnings,
/// <summary>
/// Count of BCD-rewriter slot decisions where the wire value was not a valid BCD
/// nibble pattern (e.g. <c>0xABCD</c> at a tag address). The slot passes through
/// unrewritten and this counter increments.
/// </summary>
long InvalidBcdWarnings);
/// <summary>Per-function-code request counts.</summary> /// <summary>Per-function-code request counts.</summary>
public sealed record FcCounts( public sealed record FcCounts(
@@ -69,15 +75,16 @@ public sealed record FcCounts(
long Other); long Other);
/// <summary> /// <summary>
/// Backend connect, exception, and multiplexer telemetry. Phase 9 added /// Backend connect, exception, and multiplexer telemetry, including the in-flight
/// <c>InFlight</c>, <c>MaxInFlight</c>, <c>TxIdWraps</c>, <c>DisconnectCascades</c>, and /// multiplexer fields (<c>InFlight</c>, <c>MaxInFlight</c>, <c>TxIdWraps</c>,
/// <c>QueueDepth</c>. Phase 10 added the three coalescing counters /// <c>DisconnectCascades</c>, <c>QueueDepth</c>), the read-coalescing counters
/// (<c>CoalescedHitCount</c>, <c>CoalescedMissCount</c>, <c>CoalescedResponseToDeadUpstream</c>); /// (<c>CoalescedHitCount</c>, <c>CoalescedMissCount</c>, <c>CoalescedResponseToDeadUpstream</c>),
/// the dashboard-side derived <c>coalescingRatio</c> is intentionally NOT carried on the wire /// and the response-cache counters (<c>CacheHitCount</c>, <c>CacheMissCount</c>,
/// — consumers compute <c>Hit / (Hit + Miss)</c>. Phase 11 added the five cache counters /// <c>CacheInvalidations</c>, <c>CacheEntryCount</c>, <c>CacheBytes</c>).
/// (<c>CacheHitCount</c>, <c>CacheMissCount</c>, <c>CacheInvalidations</c>, ///
/// <c>CacheEntryCount</c>, <c>CacheBytes</c>); the dashboard-side derived /// <para>The dashboard-side derived ratios <c>coalescingRatio</c> and <c>cacheHitRatio</c>
/// <c>cacheHitRatio</c> is intentionally NOT carried on the wire. /// are intentionally NOT carried on the wire — consumers compute <c>Hit / (Hit + Miss)</c>
/// from the raw counters.</para>
/// </summary> /// </summary>
public sealed record PlcBackendStatus( public sealed record PlcBackendStatus(
long ConnectsSuccess, long ConnectsSuccess,
@@ -96,14 +103,25 @@ public sealed record PlcBackendStatus(
long CacheMissCount, long CacheMissCount,
long CacheInvalidations, long CacheInvalidations,
long CacheEntryCount, long CacheEntryCount,
long CacheBytes); long CacheBytes,
/// <summary>Backend keepalive heartbeat probes issued on idle backend sockets.</summary>
long BackendHeartbeatsSent,
/// <summary>Keepalive heartbeat probes that timed out (backend not answering).</summary>
long BackendHeartbeatsFailed,
/// <summary>Backend teardowns triggered by a failed keepalive heartbeat.</summary>
long BackendIdleDisconnects);
/// <summary>Modbus exception counts by code.</summary> /// <summary>Modbus exception counts by code.</summary>
public sealed record ExceptionCounts( public sealed record ExceptionCounts(
long Code01, long Code01,
long Code02, long Code02,
long Code03, long Code03,
long Code04); long Code04,
/// <summary>
/// Backend exceptions whose response code is not 0104 (e.g. 0x06 Server Device
/// Busy, 0x0B Gateway Target Failed To Respond, vendor-specific codes).
/// </summary>
long CodeOther);
/// <summary>Byte-transfer counters.</summary> /// <summary>Byte-transfer counters.</summary>
public sealed record PlcBytesStatus( public sealed record PlcBytesStatus(
+35 -12
View File
@@ -5,7 +5,7 @@ namespace Mbproxy.Admin;
/// <summary> /// <summary>
/// Renders a <see cref="StatusResponse"/> as a self-contained HTML page. /// Renders a <see cref="StatusResponse"/> as a self-contained HTML page.
/// ///
/// <para>Constraints (from design.md Phase 07):</para> /// <para>Constraints (see <c>docs/Operations/StatusPage.md</c>):</para>
/// <list type="bullet"> /// <list type="bullet">
/// <item>No external assets (CSS/JS/fonts/favicons) — firewalled networks only.</item> /// <item>No external assets (CSS/JS/fonts/favicons) — firewalled networks only.</item>
/// <item><c>&lt;meta http-equiv="refresh" content="5"&gt;</c> for auto-refresh.</item> /// <item><c>&lt;meta http-equiv="refresh" content="5"&gt;</c> for auto-refresh.</item>
@@ -75,19 +75,22 @@ internal static class StatusHtmlRenderer
sb.Append("<th>Name</th><th>Host</th><th>Port</th><th>State</th>"); sb.Append("<th>Name</th><th>Host</th><th>Port</th><th>State</th>");
sb.Append("<th>Clients</th><th>PDUs fwd</th><th>FC03</th><th>FC04</th>"); sb.Append("<th>Clients</th><th>PDUs fwd</th><th>FC03</th><th>FC04</th>");
sb.Append("<th>FC06</th><th>FC16</th><th>FC?</th><th>BCD slots</th>"); sb.Append("<th>FC06</th><th>FC16</th><th>FC?</th><th>BCD slots</th>");
sb.Append("<th>Partial BCD</th><th>Ex 01</th><th>Ex 02</th><th>Ex 03</th><th>Ex 04</th>"); sb.Append("<th>Partial BCD</th><th>Invalid BCD</th><th>Ex 01</th><th>Ex 02</th><th>Ex 03</th><th>Ex 04</th><th>Ex ?</th>");
sb.Append("<th>RTT ms</th><th>Bytes in</th><th>Bytes out</th>"); sb.Append("<th>RTT ms</th><th>Bytes in</th><th>Bytes out</th>");
// Phase 9: multiplexer telemetry columns. // Multiplexer telemetry columns.
sb.Append("<th>In-flight</th><th>Max in-flight</th><th>TxId wraps</th>"); sb.Append("<th>In-flight</th><th>Max in-flight</th><th>TxId wraps</th>");
sb.Append("<th>Cascades</th><th>Queue</th>"); sb.Append("<th>Cascades</th><th>Queue</th>");
// Phase 10: coalescing column. Single cell carries hit / (hit + miss) ratio as // Coalescing column. Single cell carries hit / (hit + miss) ratio as a
// a percentage plus the raw hit count for context. Kept compact (one cell) to // percentage plus the raw hit count for context. Kept compact (one cell) to
// stay under the 50 KB page-weight budget. // stay under the 50 KB page-weight budget.
sb.Append("<th>Coal</th>"); sb.Append("<th>Coal</th>");
// Phase 11: cache column. Single cell carries hit-ratio percent plus raw hit // Cache column. Single cell carries hit-ratio percent plus raw hit count;
// count; an em-dash when no cache-eligible reads have occurred. Page-weight // an em-dash when no cache-eligible reads have occurred. Page-weight budget
// budget assertion stays under 50 KB for the 54-PLC fleet. // assertion stays under 50 KB for the 54-PLC fleet.
sb.Append("<th>Cache</th>"); sb.Append("<th>Cache</th>");
// Keepalive column — heartbeats sent, with failure / idle-disconnect counts
// shown only when non-zero.
sb.Append("<th>Keepalive</th>");
sb.Append("</tr></thead><tbody>"); sb.Append("</tr></thead><tbody>");
foreach (var plc in status.Plcs) foreach (var plc in status.Plcs)
@@ -141,21 +144,23 @@ internal static class StatusHtmlRenderer
sb.Append("<td>").Append(plc.Pdus.ByFc.Other).Append("</td>"); sb.Append("<td>").Append(plc.Pdus.ByFc.Other).Append("</td>");
sb.Append("<td>").Append(plc.Pdus.RewrittenSlots).Append("</td>"); sb.Append("<td>").Append(plc.Pdus.RewrittenSlots).Append("</td>");
sb.Append("<td>").Append(plc.Pdus.PartialBcdWarnings).Append("</td>"); sb.Append("<td>").Append(plc.Pdus.PartialBcdWarnings).Append("</td>");
sb.Append("<td>").Append(plc.Pdus.InvalidBcdWarnings).Append("</td>");
sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code01).Append("</td>"); sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code01).Append("</td>");
sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code02).Append("</td>"); sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code02).Append("</td>");
sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code03).Append("</td>"); sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code03).Append("</td>");
sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code04).Append("</td>"); sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.Code04).Append("</td>");
sb.Append("<td>").Append(plc.Backend.ExceptionsByCode.CodeOther).Append("</td>");
sb.Append("<td>").Append(plc.Backend.LastRoundTripMs.ToString("F1")).Append("</td>"); sb.Append("<td>").Append(plc.Backend.LastRoundTripMs.ToString("F1")).Append("</td>");
sb.Append("<td>").Append(plc.Bytes.UpstreamIn).Append("</td>"); sb.Append("<td>").Append(plc.Bytes.UpstreamIn).Append("</td>");
sb.Append("<td>").Append(plc.Bytes.UpstreamOut).Append("</td>"); sb.Append("<td>").Append(plc.Bytes.UpstreamOut).Append("</td>");
// Phase 9: multiplexer telemetry cells. // Multiplexer telemetry cells.
sb.Append("<td>").Append(plc.Backend.InFlight).Append("</td>"); sb.Append("<td>").Append(plc.Backend.InFlight).Append("</td>");
sb.Append("<td>").Append(plc.Backend.MaxInFlight).Append("</td>"); sb.Append("<td>").Append(plc.Backend.MaxInFlight).Append("</td>");
sb.Append("<td>").Append(plc.Backend.TxIdWraps).Append("</td>"); sb.Append("<td>").Append(plc.Backend.TxIdWraps).Append("</td>");
sb.Append("<td>").Append(plc.Backend.DisconnectCascades).Append("</td>"); sb.Append("<td>").Append(plc.Backend.DisconnectCascades).Append("</td>");
sb.Append("<td>").Append(plc.Backend.QueueDepth).Append("</td>"); sb.Append("<td>").Append(plc.Backend.QueueDepth).Append("</td>");
// Phase 10: coalescing ratio cell — "<pct>% (<hit>)". When no coalesced reads // Coalescing ratio cell — "<pct>% (<hit>)". When no coalesced reads have
// have been seen, render an em-dash to keep the cell narrow. // been seen, render an em-dash to keep the cell narrow.
long coalHit = plc.Backend.CoalescedHitCount; long coalHit = plc.Backend.CoalescedHitCount;
long coalMiss = plc.Backend.CoalescedMissCount; long coalMiss = plc.Backend.CoalescedMissCount;
sb.Append("<td>"); sb.Append("<td>");
@@ -169,7 +174,7 @@ internal static class StatusHtmlRenderer
sb.Append(pct).Append("% (").Append(coalHit).Append(')'); sb.Append(pct).Append("% (").Append(coalHit).Append(')');
} }
sb.Append("</td>"); sb.Append("</td>");
// Phase 11: cache ratio cell — same pattern as coalescing. // Cache ratio cell — same pattern as coalescing.
long cacheHit = plc.Backend.CacheHitCount; long cacheHit = plc.Backend.CacheHitCount;
long cacheMiss = plc.Backend.CacheMissCount; long cacheMiss = plc.Backend.CacheMissCount;
sb.Append("<td>"); sb.Append("<td>");
@@ -183,6 +188,24 @@ internal static class StatusHtmlRenderer
sb.Append(pct).Append("% (").Append(cacheHit).Append(')'); sb.Append(pct).Append("% (").Append(cacheHit).Append(')');
} }
sb.Append("</td>"); sb.Append("</td>");
// Keepalive cell — heartbeats sent; failures + idle-disconnects appended
// only when non-zero to keep the cell narrow.
long hbSent = plc.Backend.BackendHeartbeatsSent;
long hbFailed = plc.Backend.BackendHeartbeatsFailed;
long hbIdle = plc.Backend.BackendIdleDisconnects;
sb.Append("<td>");
if (hbSent == 0 && hbFailed == 0 && hbIdle == 0)
{
sb.Append("&mdash;");
}
else
{
sb.Append(hbSent);
if (hbFailed > 0 || hbIdle > 0)
sb.Append(" (fail ").Append(hbFailed)
.Append(", idle-disc ").Append(hbIdle).Append(')');
}
sb.Append("</td>");
sb.Append("</tr>"); sb.Append("</tr>");
} }
@@ -66,7 +66,7 @@ internal sealed class StatusSnapshotBuilder
var activeUpstreams = supervisor?.ActiveUpstreams ?? Array.Empty<UpstreamPipe>(); var activeUpstreams = supervisor?.ActiveUpstreams ?? Array.Empty<UpstreamPipe>();
var clientSnapshots = activeUpstreams var clientSnapshots = activeUpstreams
.Select(p => new ClientSnapshot( .Select(p => new ClientSnapshot(
Remote: p.RemoteEp?.ToString() ?? p.RemoteEp?.Address.ToString() ?? "?", Remote: p.RemoteEp?.ToString() ?? "?",
ConnectedAtUtc: p.ConnectedAtUtc, ConnectedAtUtc: p.ConnectedAtUtc,
PdusForwarded: p.PdusForwardedCount)) PdusForwarded: p.PdusForwardedCount))
.ToList(); .ToList();
@@ -108,9 +108,11 @@ internal sealed class StatusSnapshotBuilder
CacheInvalidations: 0, CacheInvalidations: 0,
CacheEntryCount: 0, CacheEntryCount: 0,
CacheBytes: 0, CacheBytes: 0,
ResponseDropForFullUpstream: 0); ResponseDropForFullUpstream: 0,
BackendHeartbeatsSent: 0,
BackendHeartbeatsFailed: 0,
BackendIdleDisconnects: 0);
// Phase 08: ConnectsSuccess / ConnectsFailed are now tracked in ProxyCounters.
long connectsSuccess = counters.ConnectsSuccess; long connectsSuccess = counters.ConnectsSuccess;
long connectsFailed = counters.ConnectsFailed; long connectsFailed = counters.ConnectsFailed;
@@ -129,7 +131,8 @@ internal sealed class StatusSnapshotBuilder
Forwarded: counters.PdusForwarded, Forwarded: counters.PdusForwarded,
ByFc: new FcCounts(counters.Fc03, counters.Fc04, counters.Fc06, counters.Fc16, counters.FcOther), ByFc: new FcCounts(counters.Fc03, counters.Fc04, counters.Fc06, counters.Fc16, counters.FcOther),
RewrittenSlots: counters.RewrittenSlots, RewrittenSlots: counters.RewrittenSlots,
PartialBcdWarnings: counters.PartialBcdWarnings), PartialBcdWarnings: counters.PartialBcdWarnings,
InvalidBcdWarnings: counters.InvalidBcdWarnings),
Backend: new PlcBackendStatus( Backend: new PlcBackendStatus(
ConnectsSuccess: connectsSuccess, ConnectsSuccess: connectsSuccess,
ConnectsFailed: connectsFailed, ConnectsFailed: connectsFailed,
@@ -137,7 +140,8 @@ internal sealed class StatusSnapshotBuilder
counters.BackendException01, counters.BackendException01,
counters.BackendException02, counters.BackendException02,
counters.BackendException03, counters.BackendException03,
counters.BackendException04), counters.BackendException04,
counters.BackendExceptionOther),
LastRoundTripMs: counters.LastRoundTripMs, LastRoundTripMs: counters.LastRoundTripMs,
InFlight: counters.InFlightCount, InFlight: counters.InFlightCount,
MaxInFlight: counters.MaxInFlight, MaxInFlight: counters.MaxInFlight,
@@ -151,7 +155,10 @@ internal sealed class StatusSnapshotBuilder
CacheMissCount: counters.CacheMissCount, CacheMissCount: counters.CacheMissCount,
CacheInvalidations: counters.CacheInvalidations, CacheInvalidations: counters.CacheInvalidations,
CacheEntryCount: counters.CacheEntryCount, CacheEntryCount: counters.CacheEntryCount,
CacheBytes: counters.CacheBytes), CacheBytes: counters.CacheBytes,
BackendHeartbeatsSent: counters.BackendHeartbeatsSent,
BackendHeartbeatsFailed: counters.BackendHeartbeatsFailed,
BackendIdleDisconnects: counters.BackendIdleDisconnects),
Bytes: new PlcBytesStatus( Bytes: new PlcBytesStatus(
UpstreamIn: counters.BytesUpstreamIn, UpstreamIn: counters.BytesUpstreamIn,
UpstreamOut: counters.BytesUpstreamOut))); UpstreamOut: counters.BytesUpstreamOut)));
+12 -9
View File
@@ -13,8 +13,8 @@ namespace Mbproxy.Bcd;
/// Example: 12_345_678 → low=0x5678, high=0x1234. /// Example: 12_345_678 → low=0x5678, high=0x1234.
/// ///
/// Bad-nibble policy: Decode16/Decode32 throw <see cref="FormatException"/> /// Bad-nibble policy: Decode16/Decode32 throw <see cref="FormatException"/>
/// (not a sentinel). The Phase 04 rewrite pipeline catches and surfaces the /// (not a sentinel). The rewrite pipeline catches and surfaces the exception as an
/// exception as an mbproxy.rewrite.invalid_bcd warning event. /// mbproxy.rewrite.invalid_bcd warning event.
/// </summary> /// </summary>
internal static class BcdCodec internal static class BcdCodec
{ {
@@ -97,15 +97,18 @@ internal static class BcdCodec
return hiVal * 10_000 + loVal; return hiVal * 10_000 + loVal;
} }
// ── Private helpers ───────────────────────────────────────────────────── // ── Helpers ─────────────────────────────────────────────────────────────
/// <summary>Returns true if any nibble in <paramref name="raw"/> is >= 0xA.</summary> /// <summary>
private static bool HasBadNibble(ushort raw) /// Returns true if any nibble in <paramref name="raw"/> is &gt;= 0xA (i.e. a non-BCD
/// digit). Internal so <see cref="Mbproxy.Proxy.BcdPduPipeline"/> can call it from
/// the response-rewrite path's per-word check without re-implementing the same logic.
/// </summary>
internal static bool HasBadNibble(ushort raw)
{ {
// Check each nibble independently.
return ((raw >> 12) & 0xF) >= 0xA return ((raw >> 12) & 0xF) >= 0xA
|| ((raw >> 8) & 0xF) >= 0xA || ((raw >> 8) & 0xF) >= 0xA
|| ((raw >> 4) & 0xF) >= 0xA || ((raw >> 4) & 0xF) >= 0xA
|| (raw & 0xF) >= 0xA; || (raw & 0xF) >= 0xA;
} }
} }
+4 -4
View File
@@ -4,9 +4,9 @@ namespace Mbproxy.Bcd;
/// Immutable description of a single BCD-encoded V-memory tag as seen on the Modbus wire. /// Immutable description of a single BCD-encoded V-memory tag as seen on the Modbus wire.
/// Width is 16 (one register) or 32 (two registers, CDAB low-word-first). /// Width is 16 (one register) or 32 (two registers, CDAB low-word-first).
/// ///
/// <para><b>Phase 11 — <see cref="CacheTtlMs"/></b> is the resolved per-tag response-cache /// <para><b><see cref="CacheTtlMs"/></b> is the resolved per-tag response-cache TTL in
/// TTL in milliseconds. 0 (the default) means caching is disabled for this tag. Positive /// milliseconds. 0 (the default) means caching is disabled for this tag. Positive values
/// values cap upstream staleness; the multi-tag-range read uses <c>min(TTLs)</c> across all /// cap upstream staleness; the multi-tag-range read uses <c>min(TTLs)</c> across all
/// matched tags and treats any 0 in the range as "uncached for the whole read."</para> /// matched tags and treats any 0 in the range as "uncached for the whole read."</para>
/// </summary> /// </summary>
public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0) public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0)
@@ -33,7 +33,7 @@ public sealed record BcdTag(ushort Address, byte Width, int CacheTtlMs = 0)
/// <summary>True when this tag occupies two registers (32-bit BCD).</summary> /// <summary>True when this tag occupies two registers (32-bit BCD).</summary>
public bool IsThirtyTwoBit => Width == 32; public bool IsThirtyTwoBit => Width == 32;
/// <summary>True when this tag opts into the Phase-11 response cache.</summary> /// <summary>True when this tag opts into the response cache.</summary>
public bool IsCacheable => CacheTtlMs > 0; public bool IsCacheable => CacheTtlMs > 0;
/// <summary> /// <summary>
+2 -2
View File
@@ -47,7 +47,7 @@ public sealed class BcdTagMap
=> _map.TryGetValue(address, out tag!); => _map.TryGetValue(address, out tag!);
/// <summary> /// <summary>
/// Phase 11 — resolves the effective cache TTL for an FC03/FC04 read over the range /// Resolves the effective cache TTL for an FC03/FC04 read over the range
/// [<paramref name="startAddress"/>, <paramref name="startAddress"/> + <paramref name="qty"/>). /// [<paramref name="startAddress"/>, <paramref name="startAddress"/> + <paramref name="qty"/>).
/// ///
/// <para>Returns 0 (uncached) when:</para> /// <para>Returns 0 (uncached) when:</para>
@@ -135,7 +135,7 @@ public sealed class BcdTagMap
return false; return false;
} }
// Sort ascending by offset so Phase 04 can iterate in wire order. // Sort ascending by offset so the rewrite pipeline can iterate in wire order.
result.Sort(static (a, b) => a.OffsetWords.CompareTo(b.OffsetWords)); result.Sort(static (a, b) => a.OffsetWords.CompareTo(b.OffsetWords));
hits = result; hits = result;
return true; return true;
+39 -13
View File
@@ -6,7 +6,7 @@ namespace Mbproxy.Bcd;
/// <summary> /// <summary>
/// Builds an immutable <see cref="BcdTagMap"/> from global options and optional per-PLC overrides. /// Builds an immutable <see cref="BcdTagMap"/> from global options and optional per-PLC overrides.
/// ///
/// Resolution algorithm (per design.md): /// Resolution algorithm (per docs/Features/BcdRewriting.md):
/// 1. Start with the global tag list. /// 1. Start with the global tag list.
/// 2. Remove any address present in perPlc.Remove. /// 2. Remove any address present in perPlc.Remove.
/// 3. Merge in perPlc.Add entries — if an address exists in the working set the Add entry wins /// 3. Merge in perPlc.Add entries — if an address exists in the working set the Add entry wins
@@ -34,8 +34,8 @@ public static class BcdTagMapBuilder
=> Build(global, perPlc, perPlcDefaultCacheTtlMs: 0); => Build(global, perPlc, perPlcDefaultCacheTtlMs: 0);
/// <summary> /// <summary>
/// Phase 11 overload resolves the effective BCD tag list for one PLC and validates /// Overload that resolves the effective BCD tag list for one PLC and validates it,
/// it, additionally folding the per-PLC <paramref name="perPlcDefaultCacheTtlMs"/> into /// additionally folding the per-PLC <paramref name="perPlcDefaultCacheTtlMs"/> into
/// any tag whose explicit <see cref="BcdTagOptions.CacheTtlMs"/> is null. /// any tag whose explicit <see cref="BcdTagOptions.CacheTtlMs"/> is null.
/// ///
/// <para>Resolution order per tag:</para> /// <para>Resolution order per tag:</para>
@@ -53,6 +53,30 @@ public static class BcdTagMapBuilder
var errors = new List<BcdError>(); var errors = new List<BcdError>();
var warnings = new List<BcdWarning>(); var warnings = new List<BcdWarning>();
// Duplicate-address detection happens BEFORE the working dictionary collapses
// keys. Iterating each list independently catches duplicates that would otherwise
// be silently last-write-wins'd by the dictionary indexer. Cross-list collisions
// (same address in BOTH Global and Add) are the documented width-override pattern
// and must NOT be flagged — only intra-list duplicates fail.
static void DetectIntraListDuplicates(
IEnumerable<BcdTagOptions> source, string sourceName, List<BcdError> errors)
{
var seen = new HashSet<ushort>();
foreach (var tag in source)
{
if (!seen.Add(tag.Address))
{
errors.Add(new BcdError(BcdValidationError.DuplicateAddress,
$"Address {tag.Address} appears more than once in {sourceName}.",
tag.Address));
}
}
}
DetectIntraListDuplicates(global.Global, "Global", errors);
if (perPlc?.Add is { } addListForDup)
DetectIntraListDuplicates(addListForDup, "PerPlc.Add", errors);
// ── Step 1: collect the working set keyed by address ───────────────── // ── Step 1: collect the working set keyed by address ─────────────────
// Dictionary preserves last-write-wins semantics for the Add override. // Dictionary preserves last-write-wins semantics for the Add override.
var working = new Dictionary<ushort, BcdTagOptions>(global.Global.Count); var working = new Dictionary<ushort, BcdTagOptions>(global.Global.Count);
@@ -82,7 +106,6 @@ public static class BcdTagMapBuilder
// ── Step 4: validate the resolved list ─────────────────────────────── // ── Step 4: validate the resolved list ───────────────────────────────
// We build a validated-entries list; only clean entries go into the map. // We build a validated-entries list; only clean entries go into the map.
var validated = new Dictionary<ushort, BcdTag>(working.Count); var validated = new Dictionary<ushort, BcdTag>(working.Count);
var seenAddresses = new HashSet<ushort>(working.Count);
foreach (var (addr, opt) in working) foreach (var (addr, opt) in working)
{ {
@@ -94,15 +117,7 @@ public static class BcdTagMapBuilder
continue; continue;
} }
// Duplicate address check. // Resolve the effective per-tag cache TTL:
if (!seenAddresses.Add(addr))
{
errors.Add(new BcdError(BcdValidationError.DuplicateAddress,
$"Address {addr} appears more than once in the resolved tag list.", addr));
continue;
}
// Phase 11 — resolve the effective per-tag cache TTL:
// explicit per-tag (including 0) wins; otherwise fall back to per-PLC default. // explicit per-tag (including 0) wins; otherwise fall back to per-PLC default.
int resolvedTtl = opt.CacheTtlMs ?? perPlcDefaultCacheTtlMs; int resolvedTtl = opt.CacheTtlMs ?? perPlcDefaultCacheTtlMs;
if (resolvedTtl < 0) resolvedTtl = 0; if (resolvedTtl < 0) resolvedTtl = 0;
@@ -111,6 +126,10 @@ public static class BcdTagMapBuilder
} }
// High-register collision check (only meaningful for 32-bit entries). // High-register collision check (only meaningful for 32-bit entries).
// Dedupe symmetric reports. Two 32-bit tags whose pairs collide (e.g. (100,W=32)
// and (101,W=32)) would otherwise produce two BcdErrors — one from each
// direction. Track reported (low,high) pairs so each collision logs once.
var reportedCollisions = new HashSet<(ushort, ushort)>();
foreach (var tag in validated.Values) foreach (var tag in validated.Values)
{ {
if (!tag.IsThirtyTwoBit) if (!tag.IsThirtyTwoBit)
@@ -119,6 +138,13 @@ public static class BcdTagMapBuilder
ushort highReg = tag.HighRegister; ushort highReg = tag.HighRegister;
if (validated.TryGetValue(highReg, out var collision)) if (validated.TryGetValue(highReg, out var collision))
{ {
// Canonicalise the pair so (a,b) and (b,a) collapse.
var pair = tag.Address < collision.Address
? (tag.Address, collision.Address)
: (collision.Address, tag.Address);
if (!reportedCollisions.Add(pair))
continue;
errors.Add(new BcdError(BcdValidationError.OverlappingHighRegister, errors.Add(new BcdError(BcdValidationError.OverlappingHighRegister,
$"32-bit BCD tag at address {tag.Address} has its high register " + $"32-bit BCD tag at address {tag.Address} has its high register " +
$"({highReg}) colliding with the entry at address {collision.Address}.", $"({highReg}) colliding with the entry at address {collision.Address}.",
@@ -1,3 +1,4 @@
using System.Collections.Concurrent;
using System.Threading.Channels; using System.Threading.Channels;
using Mbproxy.Bcd; using Mbproxy.Bcd;
using Mbproxy.Options; using Mbproxy.Options;
@@ -47,10 +48,23 @@ internal sealed partial class ConfigReconciler : IDisposable
private readonly ServiceCounters _serviceCounters; private readonly ServiceCounters _serviceCounters;
// The supervisor dictionary is set by ProxyWorker after initial startup. // The supervisor dictionary is set by ProxyWorker after initial startup.
// All mutations happen inside ApplyAsync which is serialised by the semaphore. // ConcurrentDictionary so the per-PLC Add/Remove/Restart task continuations inside
private Dictionary<string, PlcListenerSupervisor>? _supervisors; // ApplyUnderLockAsync can mutate it concurrently. The outer Apply is serialised by
// _applySemaphore but the inner Task.WhenAll runs in parallel.
private ConcurrentDictionary<string, PlcListenerSupervisor>? _supervisors;
private MbproxyOptions? _currentOptions; private MbproxyOptions? _currentOptions;
// Live accessor for ReadCoalescingOptions, threaded through Attach so PLCs added or
// restarted via hot-reload honour the current
// `Mbproxy.Resilience.ReadCoalescing.{Enabled,MaxParties}` values. Without this,
// reconciler-built supervisors would use the default `new ReadCoalescingOptions()`
// and a hot-reload of `Enabled = false` would not propagate to them.
private Func<ReadCoalescingOptions>? _coalescingAccessor;
// Live accessor for KeepaliveOptions, threaded through Attach so PLCs added or
// restarted via hot-reload honour the current `Connection.Keepalive` values.
private Func<KeepaliveOptions>? _keepaliveAccessor;
// ── Debounce + serialisation machinery ─────────────────────────────────────────────── // ── Debounce + serialisation machinery ───────────────────────────────────────────────
// Channel carries Unit to signal "something changed — please check". // Channel carries Unit to signal "something changed — please check".
@@ -100,18 +114,24 @@ internal sealed partial class ConfigReconciler : IDisposable
// ── Wire-up called by ProxyWorker after initial startup ────────────────────────────── // ── Wire-up called by ProxyWorker after initial startup ──────────────────────────────
/// <summary> /// <summary>
/// Provides the reconciler with the supervisor dictionary and the initial options /// Provides the reconciler with the supervisor dictionary, the initial options snapshot,
/// snapshot. Must be called exactly once by <see cref="Proxy.ProxyWorker"/> before /// and the live <see cref="ReadCoalescingOptions"/> accessor that add/restart
/// any <c>OnChange</c> events can arrive (i.e. immediately after the supervisors are /// supervisors must use so a hot-reloaded
/// created). Thread-safe: the reconciler hasn't started processing changes yet at this /// <c>Mbproxy.Resilience.ReadCoalescing.Enabled</c> propagates to them. Must be called
/// point. /// exactly once by <see cref="Proxy.ProxyWorker"/> before any <c>OnChange</c> events
/// can arrive (i.e. immediately after the supervisors are created). Thread-safe: the
/// reconciler hasn't started processing changes yet at this point.
/// </summary> /// </summary>
public void Attach( public void Attach(
Dictionary<string, PlcListenerSupervisor> supervisors, ConcurrentDictionary<string, PlcListenerSupervisor> supervisors,
MbproxyOptions initialOptions) MbproxyOptions initialOptions,
Func<ReadCoalescingOptions>? coalescingAccessor = null,
Func<KeepaliveOptions>? keepaliveAccessor = null)
{ {
_supervisors = supervisors; _supervisors = supervisors;
_currentOptions = initialOptions; _currentOptions = initialOptions;
_coalescingAccessor = coalescingAccessor;
_keepaliveAccessor = keepaliveAccessor;
} }
// ── ApplyAsync (exposed for tests) ─────────────────────────────────────────────────── // ── ApplyAsync (exposed for tests) ───────────────────────────────────────────────────
@@ -229,13 +249,12 @@ internal sealed partial class ConfigReconciler : IDisposable
if (plan.ToRemove.Count > 0) if (plan.ToRemove.Count > 0)
{ {
var removeTasks = plan.ToRemove var removeTasks = plan.ToRemove
.Where(name => _supervisors.ContainsKey(name))
.Select(async name => .Select(async name =>
{ {
if (!_supervisors.TryRemove(name, out var s))
return;
try try
{ {
var s = _supervisors[name];
_supervisors.Remove(name);
using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct); using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
stopCts.CancelAfter(TimeSpan.FromSeconds(10)); stopCts.CancelAfter(TimeSpan.FromSeconds(10));
await s.StopAsync(stopCts.Token).ConfigureAwait(false); await s.StopAsync(stopCts.Token).ConfigureAwait(false);
@@ -266,16 +285,16 @@ internal sealed partial class ConfigReconciler : IDisposable
try try
{ {
// Stop old supervisor. // Stop old supervisor.
if (_supervisors.TryGetValue(name, out var old)) if (_supervisors.TryRemove(name, out var old))
{ {
_supervisors.Remove(name);
using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct); using var stopCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
stopCts.CancelAfter(TimeSpan.FromSeconds(10)); stopCts.CancelAfter(TimeSpan.FromSeconds(10));
await old.StopAsync(stopCts.Token).ConfigureAwait(false); await old.StopAsync(stopCts.Token).ConfigureAwait(false);
await old.DisposeAsync().ConfigureAwait(false); await old.DisposeAsync().ConfigureAwait(false);
} }
// Build fresh context. Phase 11: pass DefaultCacheTtlMs. // Build fresh context. Pass DefaultCacheTtlMs so per-PLC default
// caching folds into the resolved tag map.
var result = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs); var result = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs);
var newCtx = new PerPlcContext var newCtx = new PerPlcContext
{ {
@@ -301,7 +320,9 @@ internal sealed partial class ConfigReconciler : IDisposable
newCtx, newCtx,
recoveryPipeline, recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(), _loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline); backendPipeline,
_coalescingAccessor,
_keepaliveAccessor);
_supervisors[name] = newSupervisor; _supervisors[name] = newSupervisor;
await newSupervisor.StartAsync(ct).ConfigureAwait(false); await newSupervisor.StartAsync(ct).ConfigureAwait(false);
@@ -332,8 +353,8 @@ internal sealed partial class ConfigReconciler : IDisposable
// Preserve existing counters so operators see real history. // Preserve existing counters so operators see real history.
Counters = supervisor.CurrentCounters, Counters = supervisor.CurrentCounters,
Logger = _loggerFactory.CreateLogger($"Mbproxy.Proxy.BcdRewriter.{name}"), Logger = _loggerFactory.CreateLogger($"Mbproxy.Proxy.BcdRewriter.{name}"),
// Phase 11: any reseat (tag-map change) constructs a fresh cache. // Any reseat (tag-map change) constructs a fresh cache. The
// The supervisor disposes the old one inside ReplaceContextAsync. // supervisor disposes the old one inside ReplaceContextAsync.
Cache = BuildCacheIfNeeded(newMap, next.Cache), Cache = BuildCacheIfNeeded(newMap, next.Cache),
}; };
@@ -360,7 +381,8 @@ internal sealed partial class ConfigReconciler : IDisposable
{ {
try try
{ {
// Phase 11: pass DefaultCacheTtlMs. // Pass DefaultCacheTtlMs so per-PLC default caching folds into the
// resolved tag map.
var result = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs); var result = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs);
var newCtx = new PerPlcContext var newCtx = new PerPlcContext
{ {
@@ -385,7 +407,9 @@ internal sealed partial class ConfigReconciler : IDisposable
newCtx, newCtx,
recoveryPipeline, recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(), _loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline); backendPipeline,
_coalescingAccessor,
_keepaliveAccessor);
_supervisors[plcNew.Name] = newSupervisor; _supervisors[plcNew.Name] = newSupervisor;
await newSupervisor.StartAsync(ct).ConfigureAwait(false); await newSupervisor.StartAsync(ct).ConfigureAwait(false);
@@ -413,9 +437,9 @@ internal sealed partial class ConfigReconciler : IDisposable
// ── Helpers ─────────────────────────────────────────────────────────────────────────── // ── Helpers ───────────────────────────────────────────────────────────────────────────
/// <summary> /// <summary>
/// Phase 11 — constructs a <see cref="ResponseCache"/> only when at least one resolved /// Constructs a <see cref="ResponseCache"/> only when at least one resolved tag in
/// tag in <paramref name="map"/> opts in (<see cref="BcdTag.CacheTtlMs"/> &gt; 0). /// <paramref name="map"/> opts in (<see cref="BcdTag.CacheTtlMs"/> &gt; 0). Returns
/// Returns <c>null</c> otherwise so the no-cache path is byte-identical to Phase 10. /// <c>null</c> otherwise so the no-cache path bypasses cache logic entirely.
/// </summary> /// </summary>
private static ResponseCache? BuildCacheIfNeeded(BcdTagMap map, CacheOptions opts) private static ResponseCache? BuildCacheIfNeeded(BcdTagMap map, CacheOptions opts)
{ {
@@ -78,8 +78,8 @@ public sealed record ReloadPlan(
// Tag-map change → reseat (swap context, keep socket). // Tag-map change → reseat (swap context, keep socket).
// We must build both maps to compare them structurally. // We must build both maps to compare them structurally.
// Compute happens after validation so Build should never return errors here. // Compute happens after validation so Build should never return errors here.
// Phase 11: include DefaultCacheTtlMs in the build so a per-PLC default change // Include DefaultCacheTtlMs in the build so a per-PLC default change is
// is detected by TagMapsEqual via the per-tag CacheTtlMs delta. // detected by TagMapsEqual via the per-tag CacheTtlMs delta.
var oldMap = BcdTagMapBuilder.Build(current.BcdTags, plcOld.BcdTags, plcOld.DefaultCacheTtlMs).Map; var oldMap = BcdTagMapBuilder.Build(current.BcdTags, plcOld.BcdTags, plcOld.DefaultCacheTtlMs).Map;
var newMap = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs).Map; var newMap = BcdTagMapBuilder.Build(next.BcdTags, plcNew.BcdTags, plcNew.DefaultCacheTtlMs).Map;
@@ -97,8 +97,8 @@ public sealed record ReloadPlan(
/// <summary> /// <summary>
/// Structural equality between two <see cref="BcdTagMap"/> instances: same set of /// Structural equality between two <see cref="BcdTagMap"/> instances: same set of
/// (Address, Width, CacheTtlMs) triples. Order doesn't matter — we compare as sets. /// (Address, Width, CacheTtlMs) triples. Order doesn't matter — we compare as sets.
/// Phase 11 includes <see cref="BcdTag.CacheTtlMs"/> in the comparison so a per-tag /// Includes <see cref="BcdTag.CacheTtlMs"/> in the comparison so a per-tag or per-PLC
/// or per-PLC default TTL change reseats the context (which flushes the cache). /// default TTL change reseats the context (which flushes the cache).
/// </summary> /// </summary>
private static bool TagMapsEqual(BcdTagMap a, BcdTagMap b) private static bool TagMapsEqual(BcdTagMap a, BcdTagMap b)
{ {
@@ -75,14 +75,30 @@ internal static class ReloadValidator
// ── 4. Per-PLC tag-map build ────────────────────────────────────────── // ── 4. Per-PLC tag-map build ──────────────────────────────────────────
// BcdTagMapBuilder.Build is the single source of truth for tag-list // BcdTagMapBuilder.Build is the single source of truth for tag-list
// well-formedness; we must not duplicate its validation logic here. // well-formedness; we must not duplicate its validation logic here.
// Also re-check the RESOLVED per-tag CacheTtlMs against AllowLongTtl. The raw-
// input check at section 5 covers explicit per-tag and per-PLC-default values,
// but defensively re-validating the post-fold values catches any future fold
// logic that produces a value above the gate.
bool allowLongTtlForResolved = next.Cache.AllowLongTtl;
foreach (var plc in next.Plcs) foreach (var plc in next.Plcs)
{ {
var result = BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs); var result = BcdTagMapBuilder.Build(next.BcdTags, plc.BcdTags, plc.DefaultCacheTtlMs);
foreach (var err in result.Errors) foreach (var err in result.Errors)
errs.Add($"Plc '{plc.Name}': BCD tag map error ({err.Kind}): {err.Message}"); errs.Add($"Plc '{plc.Name}': BCD tag map error ({err.Kind}): {err.Message}");
if (!allowLongTtlForResolved)
{
foreach (var tag in result.Map.All)
{
if (tag.CacheTtlMs > 60_000)
errs.Add(
$"Plc '{plc.Name}': resolved CacheTtlMs for Address {tag.Address} is " +
$"{tag.CacheTtlMs} ms (exceeds 60_000) without Cache.AllowLongTtl=true.");
}
}
} }
// ── 5. Cache TTL bounds (Phase 11) ──────────────────────────────────── // ── 5. Cache TTL bounds ───────────────────────────────────────────────
// The MbproxyOptionsValidator catches these at schema time too, but ReloadValidator // The MbproxyOptionsValidator catches these at schema time too, but ReloadValidator
// is the gate that the hot-reload path consults directly so re-checking here keeps // is the gate that the hot-reload path consults directly so re-checking here keeps
// both paths internally consistent (and the validator runs against tag-map-resolved // both paths internally consistent (and the validator runs against tag-map-resolved
@@ -113,6 +129,39 @@ internal static class ReloadValidator
if (next.Cache.EvictionIntervalMs < 0) if (next.Cache.EvictionIntervalMs < 0)
errs.Add($"Cache.EvictionIntervalMs must be >= 0; got {next.Cache.EvictionIntervalMs}."); errs.Add($"Cache.EvictionIntervalMs must be >= 0; got {next.Cache.EvictionIntervalMs}.");
// Connection timeouts must be > 0. A reload that sets any of these to 0 or
// negative would break the runtime; reject the reload as a whole.
if (next.Connection.BackendConnectTimeoutMs <= 0)
errs.Add(
$"Connection.BackendConnectTimeoutMs must be > 0; got {next.Connection.BackendConnectTimeoutMs}.");
if (next.Connection.BackendRequestTimeoutMs <= 0)
errs.Add(
$"Connection.BackendRequestTimeoutMs must be > 0; got {next.Connection.BackendRequestTimeoutMs}.");
if (next.Connection.GracefulShutdownTimeoutMs <= 0)
errs.Add(
$"Connection.GracefulShutdownTimeoutMs must be > 0; got {next.Connection.GracefulShutdownTimeoutMs}.");
// ── 6. Keepalive section ──────────────────────────────────────────────
// Schema bounds are also checked in MbproxyOptionsValidator; re-checking here keeps
// the hot-reload gate self-contained. The cross-field rule (heartbeat interval must
// sit above the request timeout, or it would fire continuously) lives only here.
var ka = next.Connection.Keepalive;
if (ka.TcpIdleTimeMs <= 0)
errs.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errs.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errs.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
if (ka.BackendHeartbeatIdleMs <= next.Connection.BackendRequestTimeoutMs)
errs.Add(
$"Connection.Keepalive.BackendHeartbeatIdleMs ({ka.BackendHeartbeatIdleMs}) must be greater " +
$"than Connection.BackendRequestTimeoutMs ({next.Connection.BackendRequestTimeoutMs}); " +
"a heartbeat interval at or below the request timeout would fire continuously.");
errors = errs; errors = errs;
return errs.Count == 0; return errs.Count == 0;
} }
@@ -0,0 +1,60 @@
namespace Mbproxy.Diagnostics;
/// <summary>
/// The platform diagnostic sink to wire for <c>Error</c>+ events — picked once,
/// at the composition root, by <see cref="DiagnosticSinkSelector"/>.
/// </summary>
internal enum DiagnosticSink
{
/// <summary>
/// No platform diagnostic sink — console (and rolling-file) sinks only. Used
/// for interactive / dev runs on every OS.
/// </summary>
None,
/// <summary>
/// Windows Application Event Log, via <see cref="EventLogBridge"/>. Selected
/// only when the process is hosted as a Windows Service.
/// </summary>
EventLog,
/// <summary>
/// Local syslog, via <see cref="SyslogBridge"/>. Selected only when the
/// process is hosted as a systemd service on Linux.
/// </summary>
Syslog,
}
/// <summary>
/// Pure platform-selection logic for the <c>Error</c>+ diagnostic sink. Holds no
/// I/O and no host APIs so it is unit-testable for every OS / host combination;
/// the host detection itself happens in <see cref="HostingExtensions.AddMbproxySerilog"/>.
/// </summary>
internal static class DiagnosticSinkSelector
{
/// <summary>
/// Picks the diagnostic sink for the current host:
/// <list type="bullet">
/// <item>Windows hosted as a Windows Service → <see cref="DiagnosticSink.EventLog"/>.</item>
/// <item>Linux hosted as a systemd service → <see cref="DiagnosticSink.Syslog"/>.</item>
/// <item>Everything else — interactive / dev runs, macOS, launches not owned
/// by an init system → <see cref="DiagnosticSink.None"/>.</item>
/// </list>
/// The managed-service gate mirrors the original <see cref="EventLogBridge"/>
/// contract: a diagnostic sink is wired only when an init system actually owns
/// the process, so dev / console runs never need an Event Log source registered
/// or a syslog socket reachable.
/// </summary>
/// <param name="isWindows">Running on Windows.</param>
/// <param name="isWindowsService">Hosted by the Windows Service Control Manager.</param>
/// <param name="isSystemd">Hosted by systemd.</param>
public static DiagnosticSink Select(bool isWindows, bool isWindowsService, bool isSystemd)
{
// Windows takes precedence: isSystemd is meaningless there, and on
// non-Windows isWindowsService is always false.
if (isWindows)
return isWindowsService ? DiagnosticSink.EventLog : DiagnosticSink.None;
return isSystemd ? DiagnosticSink.Syslog : DiagnosticSink.None;
}
}
@@ -5,6 +5,32 @@ using Serilog.Events;
namespace Mbproxy.Diagnostics; namespace Mbproxy.Diagnostics;
/// <summary>
/// Pure message-shaping helpers for the Windows Event Log. Kept on a separate,
/// non-platform-annotated type — <em>not</em> on <see cref="EventLogBridge"/>,
/// which is <c>[SupportedOSPlatform("windows")]</c> — so the truncation logic is
/// unit-testable on any OS without tripping the platform-compatibility analyzer.
/// </summary>
internal static class EventLogMessage
{
/// <summary>The Windows Event Log single-entry limit, in bytes (32 KB).</summary>
public const int MaxBytes = 32 * 1024;
/// <summary>
/// Truncates <paramref name="message"/> so its UTF-16 byte length stays within
/// <see cref="MaxBytes"/>, appending an ellipsis when truncation occurs. Shorter
/// messages are returned unchanged.
/// </summary>
public static string TruncateToLimit(string message)
{
// Rough UTF-16 upper bound: 2 bytes per char.
if (message.Length * 2 <= MaxBytes) return message;
int charLimit = MaxBytes / 2 - 3; // leave room for the "..." suffix
return message[..charLimit] + "...";
}
}
/// <summary> /// <summary>
/// Serilog sink that writes events at level Error and above to the Windows Event Log /// Serilog sink that writes events at level Error and above to the Windows Event Log
/// under source <c>mbproxy</c>. /// under source <c>mbproxy</c>.
@@ -26,13 +52,21 @@ internal sealed class EventLogBridge : ILogEventSink
{ {
private const string Source = "mbproxy"; private const string Source = "mbproxy";
private const string LogName = "Application"; private const string LogName = "Application";
private const int MaxMessageBytes = 32 * 1024; // 32 KB Event Log limit
private readonly bool _enabled; private readonly bool _enabled;
// Cache the source-exists check at construction so Emit doesn't hit the registry on
// every Error+ log line. A missing source after start requires a service restart to
// pick up; in practice install.ps1 registers it once at install.
private readonly bool _sourceExists;
public EventLogBridge(bool enabled) public EventLogBridge(bool enabled)
{ {
_enabled = enabled; _enabled = enabled;
if (_enabled)
{
try { _sourceExists = EventLog.SourceExists(Source); }
catch { _sourceExists = false; }
}
} }
/// <inheritdoc/> /// <inheritdoc/>
@@ -41,9 +75,9 @@ internal sealed class EventLogBridge : ILogEventSink
if (!_enabled) return; if (!_enabled) return;
if (logEvent.Level < LogEventLevel.Error) return; if (logEvent.Level < LogEventLevel.Error) return;
// Check that the source exists; if not, silently swallow the service // Cached at construction — silently swallow if the source isn't registered.
// account may not be able to create it and we must not crash the logger. // The service account may not be able to create it and we must not crash the logger.
if (!EventLog.SourceExists(Source)) return; if (!_sourceExists) return;
string message = logEvent.RenderMessage(); string message = logEvent.RenderMessage();
@@ -54,11 +88,7 @@ internal sealed class EventLogBridge : ILogEventSink
} }
// Truncate to the Event Log single-entry limit. // Truncate to the Event Log single-entry limit.
if (message.Length * 2 > MaxMessageBytes) // rough UTF-16 upper bound message = EventLogMessage.TruncateToLimit(message);
{
int charLimit = MaxMessageBytes / 2 - 3;
message = message[..charLimit] + "...";
}
var type = logEvent.Level switch var type = logEvent.Level switch
{ {
@@ -0,0 +1,50 @@
using Serilog;
using Serilog.Debugging;
using Serilog.Events;
namespace Mbproxy.Diagnostics;
/// <summary>
/// Wires the local-syslog sink for <c>Error</c>+ events when mbproxy runs as a
/// systemd service on Linux — the cross-platform counterpart of
/// <see cref="EventLogBridge"/>.
///
/// <para>Events at <see cref="LogEventLevel.Error"/> and above are written to the
/// local syslog socket (<c>/dev/log</c>) under the application name
/// <see cref="AppName"/>, with Serilog levels mapped to syslog severities by the
/// sink. On a systemd host the local syslog socket is provided by
/// <c>systemd-journald</c>, so these events land in the journal at
/// <c>err</c>/<c>crit</c> priority — distinct from the process's stdout, which
/// journald captures at <c>info</c>.</para>
///
/// <para>If the local syslog socket is unavailable the bridge degrades silently
/// to the console (and rolling-file) sinks rather than failing logger
/// construction, mirroring <see cref="EventLogBridge"/>'s no-op-when-unavailable
/// contract.</para>
/// </summary>
internal static class SyslogBridge
{
/// <summary>syslog application name — the <c>TAG</c> field of each entry.</summary>
internal const string AppName = "mbproxy";
/// <summary>
/// Attaches the <c>Error</c>+ local-syslog sink to <paramref name="cfg"/> and
/// returns it for fluent chaining. Never throws: a host where the syslog sink
/// cannot be configured degrades to <paramref name="cfg"/> unchanged.
/// </summary>
public static LoggerConfiguration AttachTo(LoggerConfiguration cfg)
{
try
{
return cfg.WriteTo.LocalSyslog(
appName: AppName,
restrictedToMinimumLevel: LogEventLevel.Error);
}
catch (Exception ex)
{
// Degrade to console-only rather than crash logger construction.
SelfLog.WriteLine("SyslogBridge: local syslog unavailable, console-only: {0}", ex);
return cfg;
}
}
}
+41 -26
View File
@@ -2,7 +2,10 @@ using Mbproxy.Admin;
using Mbproxy.Configuration; using Mbproxy.Configuration;
using Mbproxy.Diagnostics; using Mbproxy.Diagnostics;
using Mbproxy.Options; using Mbproxy.Options;
using Microsoft.Extensions.Hosting.Systemd;
using Microsoft.Extensions.Hosting.WindowsServices;
using Serilog; using Serilog;
using Serilog.Events;
namespace Mbproxy; namespace Mbproxy;
@@ -10,12 +13,10 @@ internal static class HostingExtensions
{ {
/// <summary> /// <summary>
/// Registers the <c>"Mbproxy"</c> configuration section, binds it to /// Registers the <c>"Mbproxy"</c> configuration section, binds it to
/// <see cref="MbproxyOptions"/> via <c>IOptionsMonitor</c>, and registers /// <see cref="MbproxyOptions"/> via <c>IOptionsMonitor</c>, registers the schema-
/// the schema-level <see cref="MbproxyOptionsValidator"/>. /// level <see cref="MbproxyOptionsValidator"/>, and registers the singleton
/// /// <see cref="ServiceCounters"/> and <see cref="ConfigReconciler"/> so they can be
/// Phase 06: also registers <see cref="ServiceCounters"/> (singleton) and /// injected into <see cref="Proxy.ProxyWorker"/>.
/// <see cref="ConfigReconciler"/> (singleton) so they can be injected into
/// <see cref="Proxy.ProxyWorker"/>.
/// </summary> /// </summary>
public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder builder) public static IHostApplicationBuilder AddMbproxyOptions(this IHostApplicationBuilder builder)
{ {
@@ -28,17 +29,17 @@ internal static class HostingExtensions
Microsoft.Extensions.Options.IValidateOptions<MbproxyOptions>, Microsoft.Extensions.Options.IValidateOptions<MbproxyOptions>,
MbproxyOptionsValidator>(); MbproxyOptionsValidator>();
// Phase 06: service-wide counters (read by Phase 07 status page). // Service-wide counters (read by the status page).
builder.Services.AddSingleton<ServiceCounters>(); builder.Services.AddSingleton<ServiceCounters>();
// Phase 06: hot-reload reconciler (singleton; subscribes to IOptionsMonitor.OnChange). // Hot-reload reconciler (singleton; subscribes to IOptionsMonitor.OnChange).
builder.Services.AddSingleton<ConfigReconciler>(); builder.Services.AddSingleton<ConfigReconciler>();
return builder; return builder;
} }
/// <summary> /// <summary>
/// Registers Phase 07 admin endpoint services: /// Registers the admin endpoint services:
/// <list type="bullet"> /// <list type="bullet">
/// <item><see cref="AssemblyVersionAccessor"/> (singleton — reads version attribute once).</item> /// <item><see cref="AssemblyVersionAccessor"/> (singleton — reads version attribute once).</item>
/// <item><see cref="StatusSnapshotBuilder"/> (singleton — pure orchestration).</item> /// <item><see cref="StatusSnapshotBuilder"/> (singleton — pure orchestration).</item>
@@ -47,8 +48,8 @@ internal static class HostingExtensions
/// Must be called after <see cref="AddMbproxyOptions"/> and after /// Must be called after <see cref="AddMbproxyOptions"/> and after
/// <c>AddHostedService&lt;ProxyWorker&gt;</c> (so ProxyWorker is available via DI). /// <c>AddHostedService&lt;ProxyWorker&gt;</c> (so ProxyWorker is available via DI).
/// ///
/// <para><b>Phase 12 (W1.5)</b> — <see cref="AdminEndpointHost"/> is no longer registered /// <para><see cref="AdminEndpointHost"/> is intentionally NOT registered via
/// via <c>AddHostedService</c>. <see cref="Proxy.ProxyWorker"/> drives its lifecycle /// <c>AddHostedService</c>. <see cref="Proxy.ProxyWorker"/> drives its lifecycle
/// directly so admin start/stop ordering matches the design contract (admin starts /// directly so admin start/stop ordering matches the design contract (admin starts
/// after listeners are up; admin stops AFTER the in-flight drain).</para> /// after listeners are up; admin stops AFTER the in-flight drain).</para>
/// </summary> /// </summary>
@@ -61,28 +62,42 @@ internal static class HostingExtensions
} }
/// <summary> /// <summary>
/// Configures Serilog from the <c>"Serilog"</c> configuration section, /// Configures Serilog from the <c>"Serilog"</c> configuration section, with console
/// with console and rolling-file sinks as defaults. /// and rolling-file sinks as defaults.
/// ///
/// <para>Phase 08: when <paramref name="addEventLogBridge"/> is <c>true</c>, the /// <para>This is the single composition-root point where the platform diagnostic
/// <see cref="Diagnostics.EventLogBridge"/> is added as a sub-sink for events at /// sink for <c>Error</c>+ events is chosen. <see cref="DiagnosticSinkSelector"/>
/// <see cref="Serilog.Events.LogEventLevel.Error"/> and above. This flag should only be /// picks it from the current host:
/// set when the service is running as a Windows Service — the bridge silently ignores /// <list type="bullet">
/// events when the Event Log source is not registered.</para> /// <item>Windows Service → <see cref="Diagnostics.EventLogBridge"/> (Application
/// Event Log).</item>
/// <item>systemd service → <see cref="Diagnostics.SyslogBridge"/> (local syslog).</item>
/// <item>interactive / dev runs (any OS) → no platform sink.</item>
/// </list>
/// Both bridges silently no-op when their backing facility is unavailable, so a
/// dev run never needs an Event Log source registered or a syslog socket.</para>
/// </summary> /// </summary>
public static IHostApplicationBuilder AddMbproxySerilog( public static IHostApplicationBuilder AddMbproxySerilog(this IHostApplicationBuilder builder)
this IHostApplicationBuilder builder,
bool addEventLogBridge = false)
{ {
var cfg = new LoggerConfiguration() var cfg = new LoggerConfiguration()
.ReadFrom.Configuration(builder.Configuration); .ReadFrom.Configuration(builder.Configuration);
if (addEventLogBridge && OperatingSystem.IsWindows()) var sink = DiagnosticSinkSelector.Select(
isWindows: OperatingSystem.IsWindows(),
isWindowsService: WindowsServiceHelpers.IsWindowsService(),
isSystemd: SystemdHelpers.IsSystemdService());
cfg = sink switch
{ {
cfg = cfg.WriteTo.Sink( // EventLogBridge is [SupportedOSPlatform("windows")]; the extra
new EventLogBridge(enabled: true), // OperatingSystem.IsWindows() guard satisfies the platform analyzer
Serilog.Events.LogEventLevel.Error); // (DiagnosticSinkSelector already guarantees Windows for this case).
} DiagnosticSink.EventLog when OperatingSystem.IsWindows()
=> cfg.WriteTo.Sink(new EventLogBridge(enabled: true), LogEventLevel.Error),
DiagnosticSink.Syslog
=> SyslogBridge.AttachTo(cfg),
_ => cfg,
};
Log.Logger = cfg.CreateLogger(); Log.Logger = cfg.CreateLogger();
+43 -14
View File
@@ -8,38 +8,48 @@
<TreatWarningsAsErrors>true</TreatWarningsAsErrors> <TreatWarningsAsErrors>true</TreatWarningsAsErrors>
<RootNamespace>Mbproxy</RootNamespace> <RootNamespace>Mbproxy</RootNamespace>
<AssemblyName>Mbproxy</AssemblyName> <AssemblyName>Mbproxy</AssemblyName>
<!-- Phase 08: Assembly version. CI can override via /p:InformationalVersion=... --> <!-- Assembly version. CI can override via /p:InformationalVersion=... -->
<InformationalVersion>1.0.0</InformationalVersion> <InformationalVersion>1.0.0</InformationalVersion>
</PropertyGroup> </PropertyGroup>
<!-- Phase 08: single-file self-contained publish (Release only; Debug stays normal for fast iteration). <!-- Single-file publish settings — apply only to a Release publish with an explicit RID.
NOTE: the resulting Mbproxy.exe is ~100 MB because the self-contained publish bundles the full Publishing with -r <rid> produces a single-file binary, self-contained by default
.NET 10 + ASP.NET Core runtime. This exceeds the original 50 MB target in the phase spec; (bundles the .NET 10 + ASP.NET Core runtime, ~100 MB) so no .NET install is needed on
the runtime size is a fixed cost of self-contained deployment on .NET 10 with ASP.NET Core. the target. Override with -p:SelfContained=false for a framework-dependent build
Operators who need a smaller footprint can use a framework-dependent publish (~1.6 MB) when the target already has the .NET 10 + ASP.NET Core runtime.
(dotnet publish -c Release -r win-x64 - -self-contained false /p:PublishSingleFile=true)
if the target machine has .NET 10 installed. --> The RID is supplied per publish (win-x64, linux-x64, ...) and is deliberately NOT
<PropertyGroup Condition="'$(Configuration)' == 'Release'"> hardcoded here — see install/publish.ps1 / install/publish.sh. The
'$(RuntimeIdentifier)' != '' guard means a plain `dotnet build -c Release` with no RID
stays an ordinary framework build (SelfContained without a RID is an SDK error). -->
<PropertyGroup Condition="'$(Configuration)' == 'Release' and '$(RuntimeIdentifier)' != ''">
<PublishSingleFile>true</PublishSingleFile> <PublishSingleFile>true</PublishSingleFile>
<SelfContained>true</SelfContained> <SelfContained>true</SelfContained>
<RuntimeIdentifier>win-x64</RuntimeIdentifier>
<IncludeNativeLibrariesForSelfExtract>true</IncludeNativeLibrariesForSelfExtract> <IncludeNativeLibrariesForSelfExtract>true</IncludeNativeLibrariesForSelfExtract>
</PropertyGroup> </PropertyGroup>
<ItemGroup> <ItemGroup>
<!-- ASP.NET Core for the Phase 07 Kestrel-hosted admin endpoint. --> <!-- ASP.NET Core for the Kestrel-hosted admin endpoint. -->
<FrameworkReference Include="Microsoft.AspNetCore.App" /> <FrameworkReference Include="Microsoft.AspNetCore.App" />
</ItemGroup> </ItemGroup>
<ItemGroup> <ItemGroup>
<!-- Microsoft.Extensions.Hosting is already included transitively via <!-- Microsoft.Extensions.Hosting is already included transitively via
Microsoft.AspNetCore.App — do not re-add it explicitly. --> Microsoft.AspNetCore.App — do not re-add it explicitly.
The two init-system integration packages are both portable: each is
safe to reference and call on any OS (the helper self-detects its host
and no-ops otherwise), so no conditional reference is needed. -->
<PackageReference Include="Microsoft.Extensions.Hosting.WindowsServices" Version="10.0.8" /> <PackageReference Include="Microsoft.Extensions.Hosting.WindowsServices" Version="10.0.8" />
<PackageReference Include="Microsoft.Extensions.Hosting.Systemd" Version="10.0.8" />
<PackageReference Include="Serilog.Extensions.Hosting" Version="10.0.0" /> <PackageReference Include="Serilog.Extensions.Hosting" Version="10.0.0" />
<PackageReference Include="Serilog.Settings.Configuration" Version="10.0.0" /> <PackageReference Include="Serilog.Settings.Configuration" Version="10.0.0" />
<PackageReference Include="Serilog.Sinks.Console" Version="6.1.1" /> <PackageReference Include="Serilog.Sinks.Console" Version="6.1.1" />
<PackageReference Include="Serilog.Sinks.File" Version="7.0.0" /> <PackageReference Include="Serilog.Sinks.File" Version="7.0.0" />
<!-- Referenced now so phase 04/05 don't need to touch this csproj; usage is deferred --> <!-- Local-syslog sink for the Linux diagnostic bridge (Error+ events).
Serilog.Sinks.SyslogMessages is the maintained IonxSolutions package. -->
<PackageReference Include="Serilog.Sinks.SyslogMessages" Version="4.1.0" />
<!-- Polly: backend-connect retry pipeline (PolicyFactory.BuildBackendConnect) and
listener-recovery pipeline (PolicyFactory.BuildListenerRecovery). -->
<PackageReference Include="Polly" Version="8.6.6" /> <PackageReference Include="Polly" Version="8.6.6" />
</ItemGroup> </ItemGroup>
@@ -48,8 +58,27 @@
<InternalsVisibleTo Include="Mbproxy.Tests" /> <InternalsVisibleTo Include="Mbproxy.Tests" />
</ItemGroup> </ItemGroup>
<!-- Link the platform-appropriate install template as the published appsettings.json so
the binary ships with a fully-commented, usable example config (PLCs, BCD tags, all
sections present) instead of an empty stub. The .NET configuration loader supports
JSONC (comments) under the default Host.CreateApplicationBuilder path, so the comments
in the template are valid at runtime.
The two templates differ only in OS-specific paths (log directory) and platform
notes. A `dotnet publish -r linux-*` (or any non-win RID) ships the Linux template;
win-* and a plain RID-less dev build ship the Windows one. -->
<ItemGroup> <ItemGroup>
<Content Update="appsettings.json"> <None Remove="appsettings.json" />
</ItemGroup>
<ItemGroup Condition="'$(RuntimeIdentifier)' == '' or $(RuntimeIdentifier.StartsWith('win'))">
<Content Include="..\..\install\mbproxy.config.template.json"
Link="appsettings.json">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content>
</ItemGroup>
<ItemGroup Condition="'$(RuntimeIdentifier)' != '' and !$(RuntimeIdentifier.StartsWith('win'))">
<Content Include="..\..\install\mbproxy.linux.config.template.json"
Link="appsettings.json">
<CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory> <CopyToOutputDirectory>PreserveNewest</CopyToOutputDirectory>
</Content> </Content>
</ItemGroup> </ItemGroup>
+4 -4
View File
@@ -6,10 +6,10 @@ public sealed class BcdTagOptions
public byte Width { get; init; } // 16 or 32 public byte Width { get; init; } // 16 or 32
/// <summary> /// <summary>
/// Phase 11 — optional opt-in to the response cache. Null (default) means /// Optional opt-in to the response cache. Null (default) means "unset" and falls
/// "unset" and falls back to the per-PLC <see cref="PlcOptions.DefaultCacheTtlMs"/>; /// back to the per-PLC <see cref="PlcOptions.DefaultCacheTtlMs"/>; 0 explicitly
/// 0 explicitly disables caching for this tag even when the PLC default is non-zero. /// disables caching for this tag even when the PLC default is non-zero. Positive
/// Positive values cap the staleness window in milliseconds. /// values cap the staleness window in milliseconds.
/// </summary> /// </summary>
public int? CacheTtlMs { get; init; } public int? CacheTtlMs { get; init; }
} }
@@ -9,4 +9,9 @@ public sealed class ConnectionOptions
/// graceful shutdown before cancelling them. Default: 10000 (10 s). /// graceful shutdown before cancelling them. Default: 10000 (10 s).
/// </summary> /// </summary>
public int GracefulShutdownTimeoutMs { get; init; } = 10000; public int GracefulShutdownTimeoutMs { get; init; } = 10000;
/// <summary>
/// TCP keepalive and backend-heartbeat connection-monitoring settings. Enabled by default.
/// </summary>
public KeepaliveOptions Keepalive { get; init; } = new();
} }
@@ -0,0 +1,52 @@
namespace Mbproxy.Options;
/// <summary>
/// TCP keepalive and application-level connection-monitoring settings.
///
/// <para>The DL205/DL260 ECOM does not emit TCP keepalives, so an idle backend socket can be
/// silently dropped by a middlebox (switch, firewall, NAT) after 2-5 minutes. These knobs
/// (a) enable OS-level <c>SO_KEEPALIVE</c> on both backend and accepted upstream sockets and
/// (b) drive a periodic Modbus FC03 heartbeat on each idle backend socket so the path stays
/// warm and a dead ECOM is detected before a real client request hits it.</para>
/// </summary>
public sealed class KeepaliveOptions
{
/// <summary>
/// Master switch. When <c>false</c>, neither <c>SO_KEEPALIVE</c> nor the backend heartbeat
/// is applied and the proxy behaves exactly as a pre-keepalive build. Default: <c>true</c>.
/// </summary>
public bool Enabled { get; init; } = true;
/// <summary>
/// <c>SO_KEEPALIVE</c> idle time in milliseconds — how long a socket may be idle before the
/// OS sends its first keepalive probe. Applied to backend and accepted upstream sockets.
/// Default: 30000 (30 s).
/// </summary>
public int TcpIdleTimeMs { get; init; } = 30000;
/// <summary>
/// <c>SO_KEEPALIVE</c> interval in milliseconds between keepalive probes once the idle time
/// has elapsed. Default: 5000 (5 s).
/// </summary>
public int TcpProbeIntervalMs { get; init; } = 5000;
/// <summary>
/// <c>SO_KEEPALIVE</c> probe count — unanswered probes before the OS declares the socket
/// dead. Default: 4.
/// </summary>
public int TcpProbeCount { get; init; } = 4;
/// <summary>
/// Backend application heartbeat: after this many milliseconds with no backend traffic, the
/// multiplexer issues a synthetic FC03 qty=1 read to keep the socket warm and prove the ECOM
/// is still answering Modbus. Must be greater than <see cref="ConnectionOptions.BackendRequestTimeoutMs"/>.
/// Default: 30000 (30 s).
/// </summary>
public int BackendHeartbeatIdleMs { get; init; } = 30000;
/// <summary>
/// Modbus PDU address read by the backend heartbeat FC03 probe. Address 0 (V0) is valid on
/// DL205/DL260 in factory absolute mode. Default: 0.
/// </summary>
public int BackendHeartbeatProbeAddress { get; init; } = 0;
}
+36 -8
View File
@@ -11,16 +11,16 @@ public sealed class MbproxyOptions
public ResilienceOptions Resilience { get; init; } = new(); public ResilienceOptions Resilience { get; init; } = new();
/// <summary> /// <summary>
/// Phase 11 — service-wide response-cache settings. The cache is opt-in /// Service-wide response-cache settings. The cache is opt-in per-tag
/// per-tag (<see cref="BcdTagOptions.CacheTtlMs"/>); this section configures the /// (<see cref="BcdTagOptions.CacheTtlMs"/>); this section configures the safety
/// safety knobs that gate / bound the cache. /// knobs that gate / bound the cache.
/// </summary> /// </summary>
public CacheOptions Cache { get; init; } = new(); public CacheOptions Cache { get; init; } = new();
} }
/// <summary> /// <summary>
/// Phase 11 — service-wide response-cache knobs. The cache is OFF by default for every /// Service-wide response-cache knobs. The cache is OFF by default for every tag;
/// tag; this section governs the limits when an operator opts a tag in. /// this section governs the limits when an operator opts a tag in.
/// </summary> /// </summary>
public sealed class CacheOptions public sealed class CacheOptions
{ {
@@ -47,8 +47,8 @@ public sealed class CacheOptions
} }
/// <summary> /// <summary>
/// Schema-level validation for <see cref="MbproxyOptions"/>. /// Schema-level validation for <see cref="MbproxyOptions"/>. Business-rule validation
/// Business-rule validation (duplicate addresses, port conflicts) is deferred to phase 06. /// (duplicate addresses, port conflicts) is delegated to <see cref="Configuration.ReloadValidator"/>.
/// </summary> /// </summary>
public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions> public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions>
{ {
@@ -68,7 +68,7 @@ public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions>
{ {
var plc = options.Plcs[i]; var plc = options.Plcs[i];
// Phase 11 — per-PLC default TTL bounds. // Per-PLC default TTL bounds.
if (plc.DefaultCacheTtlMs < 0) if (plc.DefaultCacheTtlMs < 0)
errors.Add($"Plcs[{i}] ({plc.Name}): DefaultCacheTtlMs must be >= 0."); errors.Add($"Plcs[{i}] ({plc.Name}): DefaultCacheTtlMs must be >= 0.");
else if (plc.DefaultCacheTtlMs > 60_000 && !allowLongTtl) else if (plc.DefaultCacheTtlMs > 60_000 && !allowLongTtl)
@@ -94,6 +94,34 @@ public sealed class MbproxyOptionsValidator : IValidateOptions<MbproxyOptions>
if (options.Cache.EvictionIntervalMs < 0) if (options.Cache.EvictionIntervalMs < 0)
errors.Add($"Cache.EvictionIntervalMs must be >= 0; got {options.Cache.EvictionIntervalMs}."); errors.Add($"Cache.EvictionIntervalMs must be >= 0; got {options.Cache.EvictionIntervalMs}.");
// Connection timeouts must be strictly positive. A 0 or negative value produces
// a CancelAfter(0) that fires immediately and breaks every backend connect/request.
if (options.Connection.BackendConnectTimeoutMs <= 0)
errors.Add(
$"Connection.BackendConnectTimeoutMs must be > 0; got {options.Connection.BackendConnectTimeoutMs}.");
if (options.Connection.BackendRequestTimeoutMs <= 0)
errors.Add(
$"Connection.BackendRequestTimeoutMs must be > 0; got {options.Connection.BackendRequestTimeoutMs}.");
if (options.Connection.GracefulShutdownTimeoutMs <= 0)
errors.Add(
$"Connection.GracefulShutdownTimeoutMs must be > 0; got {options.Connection.GracefulShutdownTimeoutMs}.");
// Keepalive section ranges. Cross-field rules (heartbeat interval vs request
// timeout) are enforced in ReloadValidator.
var ka = options.Connection.Keepalive;
if (ka.TcpIdleTimeMs <= 0)
errors.Add($"Connection.Keepalive.TcpIdleTimeMs must be > 0; got {ka.TcpIdleTimeMs}.");
if (ka.TcpProbeIntervalMs <= 0)
errors.Add($"Connection.Keepalive.TcpProbeIntervalMs must be > 0; got {ka.TcpProbeIntervalMs}.");
if (ka.TcpProbeCount <= 0)
errors.Add($"Connection.Keepalive.TcpProbeCount must be > 0; got {ka.TcpProbeCount}.");
if (ka.BackendHeartbeatIdleMs <= 0)
errors.Add($"Connection.Keepalive.BackendHeartbeatIdleMs must be > 0; got {ka.BackendHeartbeatIdleMs}.");
if (ka.BackendHeartbeatProbeAddress is < 0 or > 65535)
errors.Add(
$"Connection.Keepalive.BackendHeartbeatProbeAddress must be in [0, 65535]; " +
$"got {ka.BackendHeartbeatProbeAddress}.");
return errors.Count > 0 return errors.Count > 0
? ValidateOptionsResult.Fail(errors) ? ValidateOptionsResult.Fail(errors)
: ValidateOptionsResult.Success; : ValidateOptionsResult.Success;
+1 -1
View File
@@ -14,7 +14,7 @@ public sealed class PlcOptions
public PlcBcdOverrides? BcdTags { get; init; } public PlcBcdOverrides? BcdTags { get; init; }
/// <summary> /// <summary>
/// Phase 11 — per-PLC default cache TTL applied to any tag whose explicit /// Per-PLC default cache TTL applied to any tag whose explicit
/// <see cref="BcdTagOptions.CacheTtlMs"/> is unset (null). 0 (the default) means /// <see cref="BcdTagOptions.CacheTtlMs"/> is unset (null). 0 (the default) means
/// "no caching by default at this PLC". Per-tag values always win over the per-PLC /// "no caching by default at this PLC". Per-tag values always win over the per-PLC
/// default when set; an explicit zero on a tag still disables caching for that tag. /// default when set; an explicit zero on a tag still disables caching for that tag.
@@ -10,8 +10,8 @@ public sealed class ResilienceOptions
}; };
/// <summary> /// <summary>
/// Phase 10 — in-flight read coalescing options. Defaults to enabled with a 32-party /// In-flight read coalescing options. Defaults to enabled with a 32-party cap so
/// cap so unconfigured deployments get the de-duplication benefit immediately. /// unconfigured deployments get the de-duplication benefit immediately.
/// </summary> /// </summary>
public ReadCoalescingOptions ReadCoalescing { get; init; } = new(); public ReadCoalescingOptions ReadCoalescing { get; init; } = new();
} }
@@ -29,10 +29,10 @@ public sealed class RecoveryProfile
} }
/// <summary> /// <summary>
/// Phase 10 — knobs for the in-flight read-coalescing feature. The feature attaches /// Knobs for the in-flight read-coalescing feature. The feature attaches late-arriving
/// late-arriving FC03/FC04 reads of identical <c>(unitId, fc, start, qty)</c> tuples to an /// FC03/FC04 reads of identical <c>(unitId, fc, start, qty)</c> tuples to an already-
/// already-in-flight peer, fanning out the single backend response to every attached /// in-flight peer, fanning out the single backend response to every attached upstream
/// upstream client. /// client.
/// ///
/// <para>Zero post-response staleness — coalescing operates entirely within the in-flight /// <para>Zero post-response staleness — coalescing operates entirely within the in-flight
/// window (microseconds to ~10 ms typical). Once the response is delivered, the coalescing /// window (microseconds to ~10 ms typical). Once the response is delivered, the coalescing
@@ -41,10 +41,10 @@ public sealed class RecoveryProfile
public sealed class ReadCoalescingOptions public sealed class ReadCoalescingOptions
{ {
/// <summary> /// <summary>
/// Master switch. When <c>false</c>, every FC03/FC04 request takes the Phase-9 path /// Master switch. When <c>false</c>, every FC03/FC04 request allocates a fresh
/// (allocate a fresh proxy TxId and round-trip to the backend). Hot-reloadable via /// proxy TxId and round-trips to the backend without attempting to coalesce.
/// <c>IOptionsMonitor</c>; flipping to <c>false</c> at runtime does not disturb already- /// Hot-reloadable via <c>IOptionsMonitor</c>; flipping to <c>false</c> at runtime
/// coalesced entries — they drain naturally. /// does not disturb already-coalesced entries — they drain naturally.
/// </summary> /// </summary>
public bool Enabled { get; init; } = true; public bool Enabled { get; init; } = true;
+14 -10
View File
@@ -1,20 +1,24 @@
using Mbproxy; using Mbproxy;
using Mbproxy.Proxy; using Mbproxy.Proxy;
using Microsoft.Extensions.Hosting.Systemd;
using Microsoft.Extensions.Hosting.WindowsServices; using Microsoft.Extensions.Hosting.WindowsServices;
var builder = Host.CreateApplicationBuilder(args); var builder = Host.CreateApplicationBuilder(args);
// Windows Service support; no-op when running under dotnet run / console. // Init-system integration. Both helpers self-detect their host and are no-ops
// otherwise, so calling both unconditionally is correct on every platform:
// - AddWindowsService(): active only when launched by the Windows SCM.
// - AddSystemd(): active only when launched by systemd (wires sd_notify
// readiness; SIGTERM shutdown is handled by the host).
builder.Services.AddWindowsService(); builder.Services.AddWindowsService();
builder.Services.AddSystemd();
// Phase 08: wire EventLogBridge only when actually running as a Windows Service. // Wire up structured config, Serilog, and typed options. AddMbproxySerilog selects
bool isWindowsService = WindowsServiceHelpers.IsWindowsService(); // the platform diagnostic sink (Windows Event Log / syslog / none) internally.
builder.AddMbproxySerilog();
// Wire up structured config, Serilog, and typed options.
builder.AddMbproxySerilog(addEventLogBridge: isWindowsService);
builder.AddMbproxyOptions(); builder.AddMbproxyOptions();
// PDU pipeline: BcdPduPipeline is stateless (Phase 9: per-call correlation flows through // PDU pipeline: BcdPduPipeline is stateless (per-call correlation flows through
// PerPlcContext.CurrentRequest set by the multiplexer); registering as singleton is fine // PerPlcContext.CurrentRequest set by the multiplexer); registering as singleton is fine
// and avoids repeated construction. // and avoids repeated construction.
builder.Services.AddSingleton<IPduPipeline, BcdPduPipeline>(); builder.Services.AddSingleton<IPduPipeline, BcdPduPipeline>();
@@ -25,9 +29,9 @@ builder.Services.AddSingleton<IPduPipeline, BcdPduPipeline>();
builder.Services.AddSingleton<ProxyWorker>(); builder.Services.AddSingleton<ProxyWorker>();
builder.Services.AddHostedService(sp => sp.GetRequiredService<ProxyWorker>()); builder.Services.AddHostedService(sp => sp.GetRequiredService<ProxyWorker>());
// Phase 07: admin endpoint (Kestrel read-only status page). // Admin endpoint (Kestrel read-only status page). Not registered as IHostedService —
// Phase 12 (W1.5): no longer registered as IHostedService; ProxyWorker drives its // ProxyWorker drives its lifecycle so admin starts after listeners and stops AFTER the
// lifecycle so admin starts after listeners and stops AFTER the in-flight drain. // in-flight drain.
builder.AddMbproxyAdmin(); builder.AddMbproxyAdmin();
await builder.Build().RunAsync(); await builder.Build().RunAsync();
+35 -19
View File
@@ -4,9 +4,9 @@ namespace Mbproxy.Proxy;
/// <summary> /// <summary>
/// BCD-rewriting PDU pipeline. Registered as the singleton <see cref="IPduPipeline"/> /// BCD-rewriting PDU pipeline. Registered as the singleton <see cref="IPduPipeline"/>
/// in production (replaces <see cref="NoopPduPipeline"/> from Phase 03). /// in production.
/// ///
/// FC scope (per design.md): /// FC scope (per docs/Features/BcdRewriting.md):
/// FC03 / FC04 response — decode covered BCD slots from raw nibbles → binary integer. /// FC03 / FC04 response — decode covered BCD slots from raw nibbles → binary integer.
/// FC06 request — encode binary integer → BCD nibbles. /// FC06 request — encode binary integer → BCD nibbles.
/// FC16 request — per-register over the configured slots. /// FC16 request — per-register over the configured slots.
@@ -15,13 +15,13 @@ namespace Mbproxy.Proxy;
/// MBAP transparency contract: the MBAP length field is NEVER modified. Re-encoded slots /// MBAP transparency contract: the MBAP length field is NEVER modified. Re-encoded slots
/// are the same byte width as the originals (ushort → ushort), so the PDU length is stable. /// are the same byte width as the originals (ushort → ushort), so the PDU length is stable.
/// ///
/// <para><b>Phase 9 — request correlation:</b> FC03/FC04 responses do not carry the /// <para><b>Request correlation:</b> FC03/FC04 responses do not carry the original
/// original start address. The multiplexer builds an <see cref="Multiplexing.InFlightRequest"/> /// start address. The multiplexer builds an <see cref="Multiplexing.InFlightRequest"/>
/// on the request path, stores it in its <see cref="Multiplexing.CorrelationMap"/>, and /// on the request path, stores it in its <see cref="Multiplexing.CorrelationMap"/>, and
/// attaches it to the per-call <see cref="PerPlcContext.CurrentRequest"/> on the response /// attaches it to the per-call <see cref="PerPlcContext.CurrentRequest"/> on the
/// path. The rewriter consumes <c>CurrentRequest</c> instead of a per-pair last-request /// response path. The rewriter consumes <c>CurrentRequest</c>, so concurrent responses
/// slot, so concurrent responses from different upstream clients each decode against /// from different upstream clients each decode against their own request range without
/// their own request range without cross-talk.</para> /// cross-talk.</para>
/// ///
/// <para>This class is stateless. All per-call state arrives via <see cref="PduContext"/> /// <para>This class is stateless. All per-call state arrives via <see cref="PduContext"/>
/// (specifically <see cref="PerPlcContext.CurrentRequest"/> on response). It is safe to /// (specifically <see cref="PerPlcContext.CurrentRequest"/> on response). It is safe to
@@ -156,7 +156,15 @@ internal sealed class BcdPduPipeline : IPduPipeline
ushort startAddress = (ushort)((pdu[1] << 8) | pdu[2]); ushort startAddress = (ushort)((pdu[1] << 8) | pdu[2]);
ushort qty = (ushort)((pdu[3] << 8) | pdu[4]); ushort qty = (ushort)((pdu[3] << 8) | pdu[4]);
// byte byteCount = pdu[5]; (qty * 2, not used directly)
// Validate the request is fully sized for `qty` registers (each 2 bytes after
// the byteCount byte). A client claiming qty=10 with only 4 bytes of register
// data would otherwise have its BCD slots silently skipped by the per-slot
// bounds check below — half the request rewritten, half not. Returning here
// passes the malformed PDU through unchanged so the PLC's own validator
// surfaces the protocol error.
if (pdu.Length < 6 + qty * 2)
return;
if (!ctx.TagMap.TryGetForRange(startAddress, qty, out var hits)) if (!ctx.TagMap.TryGetForRange(startAddress, qty, out var hits))
return; // no BCD tags in this range return; // no BCD tags in this range
@@ -202,6 +210,22 @@ internal sealed class BcdPduPipeline : IPduPipeline
ushort clientLow = (ushort)((pdu[lowByteOff] << 8) | pdu[lowByteOff + 1]); ushort clientLow = (ushort)((pdu[lowByteOff] << 8) | pdu[lowByteOff + 1]);
ushort clientHigh = (ushort)((pdu[highByteOff] << 8) | pdu[highByteOff + 1]); ushort clientHigh = (ushort)((pdu[highByteOff] << 8) | pdu[highByteOff + 1]);
// Validate that BOTH input words are within the base-10000-digit range
// BEFORE reconstructing. Without this guard, a client writing
// (high=9999, low=9999) silently mutates to (high=9998, low=9999)
// because `9999 * 10_000 + 9999 = 99_989_999` is still <= the 32-bit
// BCD ceiling, so Encode32 accepts it and rewrites — losing 1 from the
// high word. The unconventional wire format ("two base-10000 CDAB
// digits", per docs/Features/BcdRewriting.md) means each word independently must be 0..9999
// to round-trip cleanly.
if (clientLow > 9999 || clientHigh > 9999)
{
RewriterLogEvents.InvalidBcd(ctx.Logger, ctx.PlcName, tag.Address,
clientLow, "Write");
ctx.Counters.IncrementInvalidBcd();
continue;
}
// Reconstruct the 32-bit binary value (CDAB: low-word = low digits). // Reconstruct the 32-bit binary value (CDAB: low-word = low digits).
int binaryValue = clientHigh * 10_000 + clientLow; int binaryValue = clientHigh * 10_000 + clientLow;
@@ -361,8 +385,8 @@ internal sealed class BcdPduPipeline : IPduPipeline
catch (FormatException) catch (FormatException)
{ {
// Emit invalid_bcd for the low register (first bad word we'd encounter). // Emit invalid_bcd for the low register (first bad word we'd encounter).
ushort badRaw = HasBadNibble(rawLow) ? rawLow : rawHigh; ushort badRaw = BcdCodec.HasBadNibble(rawLow) ? rawLow : rawHigh;
ushort badAddr = HasBadNibble(rawLow) ? tag.Address : tag.HighRegister; ushort badAddr = BcdCodec.HasBadNibble(rawLow) ? tag.Address : tag.HighRegister;
RewriterLogEvents.InvalidBcd(ctx.Logger, ctx.PlcName, badAddr, badRaw, "Read"); RewriterLogEvents.InvalidBcd(ctx.Logger, ctx.PlcName, badAddr, badRaw, "Read");
ctx.Counters.IncrementInvalidBcd(); ctx.Counters.IncrementInvalidBcd();
continue; continue;
@@ -449,12 +473,4 @@ internal sealed class BcdPduPipeline : IPduPipeline
// already counted this slot on the way out. Incrementing again would double-count. // already counted this slot on the way out. Incrementing again would double-count.
} }
// ── Helpers ──────────────────────────────────────────────────────────────
/// <summary>Returns true if any nibble of <paramref name="raw"/> is >= 0xA.</summary>
private static bool HasBadNibble(ushort raw)
=> ((raw >> 12) & 0xF) >= 0xA
|| ((raw >> 8) & 0xF) >= 0xA
|| ((raw >> 4) & 0xF) >= 0xA
|| (raw & 0xF) >= 0xA;
} }
+4 -4
View File
@@ -4,10 +4,10 @@ namespace Mbproxy.Proxy.Cache;
/// <summary> /// <summary>
/// Hash key for the per-PLC <see cref="ResponseCache"/>. Structurally identical to /// Hash key for the per-PLC <see cref="ResponseCache"/>. Structurally identical to
/// Phase 10's <see cref="CoalescingKey"/> — both keys discriminate the same dimensions /// the read-coalescing <see cref="CoalescingKey"/> — both keys discriminate the same
/// (UnitId, FunctionCode, StartAddress, Quantity), but the two type aliases live in /// dimensions (UnitId, FunctionCode, StartAddress, Quantity), but the two type aliases
/// different namespaces so the two phases can evolve independently without one shaping /// live in different namespaces so the cache and the coalescer can evolve independently
/// the other's API surface. /// without one shaping the other's API surface.
/// ///
/// <para><b>Equality semantics:</b> record-struct value equality. FC03 and FC04 produce /// <para><b>Equality semantics:</b> record-struct value equality. FC03 and FC04 produce
/// different keys for the same address (different Modbus tables); different /// different keys for the same address (different Modbus tables); different
@@ -1,8 +1,8 @@
namespace Mbproxy.Proxy.Cache; namespace Mbproxy.Proxy.Cache;
/// <summary> /// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the Phase-11 response /// Source-generated <see cref="LoggerMessage"/> definitions for the response cache.
/// cache. Event names are stable — do not rename without updating <c>docs/design.md</c>'s /// Event names are stable — do not rename without updating <c>docs/Reference/LogEvents.md</c>'s
/// Logging event-name table. /// Logging event-name table.
/// ///
/// <para>Levels are conservative — a busy PLC under steady cache pressure would emit one /// <para>Levels are conservative — a busy PLC under steady cache pressure would emit one
@@ -1,7 +1,7 @@
namespace Mbproxy.Proxy.Cache; namespace Mbproxy.Proxy.Cache;
/// <summary> /// <summary>
/// Per-PLC opt-in response cache for FC03 / FC04 read responses. Phase 11. /// Per-PLC opt-in response cache for FC03 / FC04 read responses.
/// ///
/// <para><b>Lifecycle.</b> One instance per PLC, owned by the per-PLC context. The cache /// <para><b>Lifecycle.</b> One instance per PLC, owned by the per-PLC context. The cache
/// is consulted on every FC03/FC04 request before coalescing; populated by the backend /// is consulted on every FC03/FC04 request before coalescing; populated by the backend
+7 -9
View File
@@ -13,17 +13,15 @@ public enum MbapDirection
} }
/// <summary> /// <summary>
/// Per-pair context carried through each PDU pipeline call. /// Per-pair context carried through each PDU pipeline call. Carries only
/// Phase 03: carries only <see cref="PlcName"/>. /// <see cref="PlcName"/> at the base level; <see cref="PerPlcContext"/> extends it with
/// Phase 04 extends this via <see cref="PerPlcContext"/>, which carries the BcdTagMap, /// the BcdTagMap, counters, logger, and per-call <c>CurrentRequest</c> slot for
/// counters, and logger. Phase 09 added the per-call <c>CurrentRequest</c> slot to /// multiplexer-aware response correlation.
/// <see cref="PerPlcContext"/> for multiplexer-aware response correlation.
/// </summary> /// </summary>
public class PduContext public class PduContext
{ {
/// <summary>The configured PLC name (from <c>MbproxyOptions.Plcs[i].Name</c>).</summary> /// <summary>The configured PLC name (from <c>MbproxyOptions.Plcs[i].Name</c>).</summary>
public string PlcName { get; init; } = ""; public string PlcName { get; init; } = "";
// Phase 04 adds: BcdTagMap, counters, logger
} }
/// <summary> /// <summary>
@@ -31,8 +29,8 @@ public class PduContext
/// Called once per frame in each direction (request and response). /// Called once per frame in each direction (request and response).
/// ///
/// Implementations must be safe to call concurrently from multiple connection pairs. /// Implementations must be safe to call concurrently from multiple connection pairs.
/// In Phase 03 the only implementation is <see cref="NoopPduPipeline"/> (pass-through). /// Production wires <see cref="BcdPduPipeline"/>; <see cref="NoopPduPipeline"/> is a
/// Phase 04 replaces it with a BCD rewriter registered via DI. /// pass-through fallback used by tests.
/// </summary> /// </summary>
public interface IPduPipeline public interface IPduPipeline
{ {
@@ -42,6 +40,6 @@ public interface IPduPipeline
/// <param name="direction">Whether this is a request or a response frame.</param> /// <param name="direction">Whether this is a request or a response frame.</param>
/// <param name="mbapHeader">The 7-byte MBAP header (read-only; includes TxId, UnitId, FC is in pdu[0]).</param> /// <param name="mbapHeader">The 7-byte MBAP header (read-only; includes TxId, UnitId, FC is in pdu[0]).</param>
/// <param name="pdu">The PDU bytes starting at the function code. May be mutated in place.</param> /// <param name="pdu">The PDU bytes starting at the function code. May be mutated in place.</param>
/// <param name="context">Per-pair context (PLC name; extended in phase 04).</param> /// <param name="context">Per-pair context (PLC name; extended via <see cref="PerPlcContext"/>).</param>
void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context); void Process(MbapDirection direction, ReadOnlySpan<byte> mbapHeader, Span<byte> pdu, PduContext context);
} }
@@ -1,14 +1,14 @@
namespace Mbproxy.Proxy.Multiplexing; namespace Mbproxy.Proxy.Multiplexing;
/// <summary> /// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the Phase-10 read-coalescing /// Source-generated <see cref="LoggerMessage"/> definitions for the read-coalescing
/// feature. Event names are stable — do not rename without updating docs/design.md's /// feature. Event names are stable — do not rename without updating docs/Reference/LogEvents.md's
/// "Logging" event-name table. /// "Logging" event-name table.
/// ///
/// <para>Levels are intentionally conservative — coalescing fires on every overlapping /// <para>Levels are intentionally conservative — coalescing fires on every overlapping
/// read in a busy fleet (HMIs/historians polling the same screen tags), so the steady-state /// read in a busy fleet (HMIs/historians polling the same screen tags), so the
/// log volume would be deafening at Information. The counters surface the same data at /// steady-state log volume would be deafening at Information. The counters surface the
/// far lower cost.</para> /// same data at far lower cost.</para>
/// </summary> /// </summary>
internal static partial class CoalescingLogEvents internal static partial class CoalescingLogEvents
{ {
@@ -8,9 +8,9 @@ namespace Mbproxy.Proxy.Multiplexing;
/// when the matching response arrives. /// when the matching response arrives.
/// ///
/// <para>Backed by <see cref="ConcurrentDictionary{TKey, TValue}"/>. The single-writer / /// <para>Backed by <see cref="ConcurrentDictionary{TKey, TValue}"/>. The single-writer /
/// single-remover pattern in Phase 9 does not strictly require it — but cascade-on- /// single-remover pattern does not strictly require it — but cascade-on-disconnect walks
/// disconnect walks the map from a separate task and Phase 10 adds upstream-side /// the map from a separate task and the coalescing path adds upstream-side cancellation
/// cancellation paths, so the safer primitive is worth the negligible cost.</para> /// paths, so the safer primitive is worth the negligible cost.</para>
/// </summary> /// </summary>
internal sealed class CorrelationMap internal sealed class CorrelationMap
{ {
@@ -1,16 +1,16 @@
namespace Mbproxy.Proxy.Multiplexing; namespace Mbproxy.Proxy.Multiplexing;
/// <summary> /// <summary>
/// Per-PLC "in-flight by key" map that powers <b>Phase 10 read coalescing</b>. Holds the /// Per-PLC "in-flight by key" map that powers read coalescing. Holds the currently-
/// currently-in-flight FC03/FC04 requests keyed by their <see cref="CoalescingKey"/> so a /// in-flight FC03/FC04 requests keyed by their <see cref="CoalescingKey"/> so a
/// late-arriving request with an identical key can attach to the existing in-flight entry /// late-arriving request with an identical key can attach to the existing in-flight entry
/// instead of opening a second backend round-trip. /// instead of opening a second backend round-trip.
/// ///
/// <para><b>Concurrency model.</b> A single <see cref="object"/> lock serialises every /// <para><b>Concurrency model.</b> A single <see cref="object"/> lock serialises every
/// state-touching method. The simpler-lock-over-CAS choice is deliberate (per the phase /// state-touching method. The simpler-lock-over-CAS choice is deliberate the map is
/// doc) — the map is per-PLC and the wire rate per PLC is bounded by the ECOM's internal /// per-PLC and the wire rate per PLC is bounded by the ECOM's internal scan cadence
/// scan cadence (~210 ms per request). The lock-free <c>AddOrUpdate</c> alternative is not /// (~210 ms per request). The lock-free <c>AddOrUpdate</c> alternative is not worth the
/// worth the read-and-prove-it-correct burden.</para> /// read-and-prove-it-correct burden.</para>
/// ///
/// <para><b>Mutable-list seam.</b> Each entry stores a <see cref="List{InterestedParty}"/> /// <para><b>Mutable-list seam.</b> Each entry stores a <see cref="List{InterestedParty}"/>
/// that is also exposed through the parent <see cref="InFlightRequest.InterestedParties"/> /// that is also exposed through the parent <see cref="InFlightRequest.InterestedParties"/>
@@ -55,11 +55,8 @@ internal sealed class InFlightByKeyMap
/// already has <paramref name="maxParties"/> attached parties, the next arrival opens /// already has <paramref name="maxParties"/> attached parties, the next arrival opens
/// a fresh entry (and a fresh backend round-trip). This bounds the response-fanout /// a fresh entry (and a fresh backend round-trip). This bounds the response-fanout
/// cost per entry at O(maxParties).</para> /// cost per entry at O(maxParties).</para>
///
/// <para>Returns <c>true</c> always (the bool return matches the phase doc's signature;
/// future evolution could introduce a refusal path).</para>
/// </summary> /// </summary>
public bool TryAttachOrCreate( public void AttachOrCreate(
CoalescingKey key, CoalescingKey key,
InterestedParty party, InterestedParty party,
Func<InFlightRequest> factory, Func<InFlightRequest> factory,
@@ -76,13 +73,12 @@ internal sealed class InFlightByKeyMap
existingList.Add(party); existingList.Add(party);
req = existing; req = existing;
wasNew = false; wasNew = false;
return true; return;
} }
req = factory(); req = factory();
_entries[key] = req; _entries[key] = req;
wasNew = true; wasNew = true;
return true;
} }
} }
@@ -6,10 +6,9 @@ namespace Mbproxy.Proxy.Multiplexing;
/// multiplexer must rewrite the response's MBAP TxId back to <see cref="OriginalTxId"/> /// multiplexer must rewrite the response's MBAP TxId back to <see cref="OriginalTxId"/>
/// before handing the frame to the pipe, so each upstream sees the proxy as transparent. /// before handing the frame to the pipe, so each upstream sees the proxy as transparent.
/// ///
/// <para><b>Phase 9 invariant:</b> exactly one <see cref="InterestedParty"/> per /// <para>Read coalescing fans out a single backend response to multiple upstream parties
/// <see cref="InFlightRequest"/>. <b>Phase 10 (read coalescing)</b> reuses this exact /// via this record. Do not collapse this into a single field on
/// shape to fan-out a single backend response to multiple upstream parties. Do not /// <see cref="InFlightRequest"/>.</para>
/// collapse this into a single field on <see cref="InFlightRequest"/>.</para>
/// </summary> /// </summary>
internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId); internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
@@ -22,16 +21,19 @@ internal sealed record InterestedParty(UpstreamPipe Pipe, ushort OriginalTxId);
/// <item><description>Provide the BCD rewriter with the originating request's /// <item><description>Provide the BCD rewriter with the originating request's
/// <c>StartAddress</c> / <c>Qty</c> for FC03/FC04 response decoding — the response /// <c>StartAddress</c> / <c>Qty</c> for FC03/FC04 response decoding — the response
/// PDU itself does not carry the start address.</description></item> /// PDU itself does not carry the start address.</description></item>
/// <item><description>Measure backend round-trip time via <see cref="SentAtUtc"/> /// <item><description>Measure backend round-trip time via <see cref="SentAtUtc"/>.</description></item>
/// (replaces the per-pair stopwatch slot from the 1:1 model).</description></item>
/// </list> /// </list>
/// ///
/// <para><b>Phase 9:</b> <see cref="InterestedParties"/> always has exactly one element. /// <para>The <see cref="InterestedParties"/> list shape is the load-bearing seam that
/// The list shape is the load-bearing seam that <b>Phase 10 — read coalescing</b> hooks /// read coalescing uses to fan out a single PLC response to multiple upstream clients.
/// into to fan out a single PLC response to multiple upstream clients without further /// Reviewer note: do <i>not</i> simplify back to a single <c>UpstreamPipe</c> field.</para>
/// refactor of the multiplexer's data model. Reviewer note: do <i>not</i> simplify back
/// to a single <c>UpstreamPipe</c> field.</para>
/// </summary> /// </summary>
/// <param name="IsHeartbeat">
/// <c>true</c> for the synthetic FC03 keepalive probe issued by the backend heartbeat
/// loop. Heartbeat entries carry no <see cref="InterestedParties"/>: the backend reader
/// drops the response (no fan-out, no rewriter, no cache) and the timeout watchdog tears
/// the backend down instead of dispatching a 0x0B exception. Defaults to <c>false</c>.
/// </param>
internal sealed record InFlightRequest( internal sealed record InFlightRequest(
byte UnitId, byte UnitId,
byte Fc, byte Fc,
@@ -39,4 +41,5 @@ internal sealed record InFlightRequest(
ushort Qty, ushort Qty,
IReadOnlyList<InterestedParty> InterestedParties, IReadOnlyList<InterestedParty> InterestedParties,
DateTimeOffset SentAtUtc, DateTimeOffset SentAtUtc,
int ResolvedCacheTtlMs = 0); int ResolvedCacheTtlMs = 0,
bool IsHeartbeat = false);
@@ -0,0 +1,54 @@
namespace Mbproxy.Proxy.Multiplexing;
/// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the backend keepalive
/// heartbeat. Event names are stable — do not rename without updating
/// docs/Reference/LogEvents.md's event-name table.
/// </summary>
internal static partial class KeepaliveLogEvents
{
/// <summary>
/// Emitted each time the heartbeat loop issues a synthetic FC03 probe on an idle
/// backend socket. Debug level — one per <c>BackendHeartbeatIdleMs</c> per idle PLC.
/// </summary>
[LoggerMessage(
EventId = 150,
EventName = "mbproxy.keepalive.heartbeat.sent",
Level = LogLevel.Debug,
Message = "Keepalive heartbeat sent: Plc={Plc} ProxyTxId={ProxyTxId} Address={Address}")]
public static partial void HeartbeatSent(
ILogger logger,
string plc,
ushort proxyTxId,
ushort address);
/// <summary>
/// Emitted when a keepalive heartbeat probe is not answered within
/// <c>BackendRequestTimeoutMs</c>. The backend is connected-but-not-answering; the
/// multiplexer tears it down (see <see cref="BackendIdleDisconnect"/>).
/// </summary>
[LoggerMessage(
EventId = 151,
EventName = "mbproxy.keepalive.heartbeat.timeout",
Level = LogLevel.Warning,
Message = "Keepalive heartbeat timed out: Plc={Plc} ProxyTxId={ProxyTxId} ElapsedMs={ElapsedMs}")]
public static partial void HeartbeatTimeout(
ILogger logger,
string plc,
ushort proxyTxId,
long elapsedMs);
/// <summary>
/// Emitted when a failed keepalive heartbeat triggers a proactive backend teardown.
/// Every attached upstream pipe is cascaded; clients reconnect on their next request.
/// </summary>
[LoggerMessage(
EventId = 152,
EventName = "mbproxy.keepalive.backend.idle_disconnect",
Level = LogLevel.Information,
Message = "Backend torn down by keepalive: Plc={Plc} HeartbeatElapsedMs={ElapsedMs}")]
public static partial void BackendIdleDisconnect(
ILogger logger,
string plc,
long elapsedMs);
}
@@ -3,14 +3,12 @@ namespace Mbproxy.Proxy.Multiplexing;
/// <summary> /// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the TxId-multiplexing /// Source-generated <see cref="LoggerMessage"/> definitions for the TxId-multiplexing
/// connection layer. Event names are stable — do not rename without updating /// connection layer. Event names are stable — do not rename without updating
/// docs/design.md's "Logging" event-name table. /// docs/Reference/LogEvents.md's "Logging" event-name table.
/// </summary> /// </summary>
internal static partial class MultiplexerLogEvents internal static partial class MultiplexerLogEvents
{ {
/// <summary> /// <summary>
/// Emitted once per upstream client accept. Replaces the per-pair /// Emitted once per upstream client accept.
/// <c>mbproxy.client.connected</c> event from the 1:1 model (same event name,
/// same property shape — operators' log queries are unchanged).
/// </summary> /// </summary>
[LoggerMessage( [LoggerMessage(
EventId = 110, EventId = 110,
@@ -84,9 +82,7 @@ internal static partial class MultiplexerLogEvents
string remoteEp); string remoteEp);
/// <summary> /// <summary>
/// Emitted when the backend connect Polly pipeline fails. Mirrors the existing /// Emitted when the backend connect Polly pipeline fails.
/// <c>mbproxy.backend.failed</c> event from the 1:1 model so operators' alerts keep
/// working unchanged after Phase 9.
/// </summary> /// </summary>
[LoggerMessage( [LoggerMessage(
EventId = 115, EventId = 115,
@@ -48,19 +48,25 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
private readonly ConnectionOptions _connectionOptions; private readonly ConnectionOptions _connectionOptions;
private readonly IPduPipeline _pipeline; private readonly IPduPipeline _pipeline;
// Phase 12 (W1.1) — `_ctx` is volatile so a hot-reload reseat can swap it on the running // `_ctx` is volatile so a hot-reload reseat can swap it on the running
// multiplexer. Each method that uses the context snapshots it into a local at the start // multiplexer. Each method that uses the context snapshots it into a local at the start
// of the operation so a single PDU sees a consistent (TagMap, Cache) pair even if the // of the operation so a single PDU sees a consistent (TagMap, Cache) pair even if the
// swap fires mid-PDU. ReplaceContext is the single mutator. // swap fires mid-PDU. ReplaceContext is the single mutator.
private volatile PerPlcContext _ctx; private volatile PerPlcContext _ctx;
private readonly ILogger<PlcMultiplexer> _logger; private readonly ILogger<PlcMultiplexer> _logger;
private readonly ResiliencePipeline? _backendConnectPipeline; private readonly ResiliencePipeline? _backendConnectPipeline;
// Phase 10: live read-coalescing config accessor. The accessor is read per-PDU on the // Live read-coalescing config accessor. The accessor is read per-PDU on the
// request path so a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` // request path so a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled`
// propagates immediately. Production wires this to // propagates immediately. Production wires this to
// `() => optionsMonitor.CurrentValue.Resilience.ReadCoalescing`. Tests default to a // `() => optionsMonitor.CurrentValue.Resilience.ReadCoalescing`. Tests default to a
// fresh `ReadCoalescingOptions()` (Enabled = true, MaxParties = 32). // fresh `ReadCoalescingOptions()` (Enabled = true, MaxParties = 32).
private readonly Func<ReadCoalescingOptions> _coalescingOptions; private readonly Func<ReadCoalescingOptions> _coalescingOptions;
// Live keepalive config accessor. Read at backend-connect time (TCP SO_KEEPALIVE) and
// on each heartbeat-loop tick (idle threshold + probe address) so a hot-reload of
// `Connection.Keepalive` propagates without a listener restart. Production wires this
// to `() => optionsMonitor.CurrentValue.Connection.Keepalive`; the fallback reads the
// construction-time `ConnectionOptions` snapshot.
private readonly Func<KeepaliveOptions> _keepaliveOptions;
private readonly TxIdAllocator _allocator = new(); private readonly TxIdAllocator _allocator = new();
private readonly CorrelationMap _correlation = new(); private readonly CorrelationMap _correlation = new();
@@ -74,8 +80,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
SingleWriter = false, SingleWriter = false,
}); });
// Attached pipes — Phase 9 needs the list for the status page; Phase 10 will need it for // Attached pipes — used by the status page and by coalescing fan-out.
// coalescing (fan-out). ConcurrentDictionary keyed on UpstreamPipe.Id for O(1) detach. // ConcurrentDictionary keyed on UpstreamPipe.Id for O(1) detach.
private readonly ConcurrentDictionary<Guid, UpstreamPipe> _pipes = new(); private readonly ConcurrentDictionary<Guid, UpstreamPipe> _pipes = new();
// Lifecycle plumbing. Backend tasks share a CTS; cascading disconnect cancels it, // Lifecycle plumbing. Backend tasks share a CTS; cascading disconnect cancels it,
@@ -86,9 +92,26 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
private CancellationTokenSource? _backendCts; private CancellationTokenSource? _backendCts;
private Task? _backendWriterTask; private Task? _backendWriterTask;
private Task? _backendReaderTask; private Task? _backendReaderTask;
private Task? _backendHeartbeatTask;
// UTC ticks of the last backend socket activity (a send OR a received frame). Updated
// by the writer and reader tasks; read by the heartbeat loop to decide whether the
// socket has been idle long enough to warrant a probe. Interlocked for cross-task
// coherence.
private long _lastBackendActivityTicks;
// Unit ID of the most recent upstream request. The synthetic heartbeat reuses it so
// the probe targets the same Modbus unit the real clients successfully talk to.
// Defaults to 0 until the first upstream frame is seen; by the time a heartbeat can
// fire the backend socket exists, which means at least one upstream frame arrived.
private int _lastSeenUnitId;
private readonly CancellationTokenSource _disposeCts = new(); private readonly CancellationTokenSource _disposeCts = new();
private bool _disposed; // Volatile so the disposing thread's write is observed by every hot-path reader
// (OnUpstreamFrameAsync, ReplaceContext, Attach, etc.) without a separate fence.
// On x86/x64 plain reads happen to give acquire-release semantics, so this is
// defense for ARM hosts and future portability.
private volatile bool _disposed;
private Task? _watchdogTask; private Task? _watchdogTask;
public PlcMultiplexer( public PlcMultiplexer(
@@ -98,7 +121,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
PerPlcContext perPlcContext, PerPlcContext perPlcContext,
ILogger<PlcMultiplexer> logger, ILogger<PlcMultiplexer> logger,
ResiliencePipeline? backendConnectPipeline = null, ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null) Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{ {
_plc = plc; _plc = plc;
_connectionOptions = connectionOptions; _connectionOptions = connectionOptions;
@@ -107,9 +131,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_logger = logger; _logger = logger;
_backendConnectPipeline = backendConnectPipeline; _backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions ?? (static () => new ReadCoalescingOptions()); _coalescingOptions = coalescingOptions ?? (static () => new ReadCoalescingOptions());
_keepaliveOptions = keepaliveOptions ?? (() => _connectionOptions.Keepalive);
// Phase 11 — register the per-PLC cache as the live stats source for the snapshot // Register the per-PLC cache as the live stats source for the snapshot path.
// path. Cache may be null when the per-PLC context has not been wired with one // Cache may be null when the per-PLC context has not been wired with one
// (every tag uncached, or unit tests). // (every tag uncached, or unit tests).
if (_ctx.Cache is not null) if (_ctx.Cache is not null)
_ctx.Counters.SetCacheStatsProvider(new CacheStatsAdapter(_ctx.Cache)); _ctx.Counters.SetCacheStatsProvider(new CacheStatsAdapter(_ctx.Cache));
@@ -151,8 +176,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
/// <summary> /// <summary>
/// Phase 12 (W1.1) — atomically swaps the per-PLC context on a running multiplexer. /// Atomically swaps the per-PLC context on a running multiplexer. Called by
/// Called by <see cref="Supervision.PlcListenerSupervisor.ReplaceContextAsync"/> when a /// <see cref="Supervision.PlcListenerSupervisor.ReplaceContextAsync"/> when a
/// hot-reload tag-list change is applied to a PLC whose listener is already bound. /// hot-reload tag-list change is applied to a PLC whose listener is already bound.
/// ///
/// <para>The new context's tag map and (optional) response cache become visible on the /// <para>The new context's tag map and (optional) response cache become visible on the
@@ -170,13 +195,23 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{ {
if (_disposed) return; if (_disposed) return;
_ctx = newContext; // Provider FIRST, then _ctx. The status page's snapshot path reads
// `_cacheStatsProvider` independently of `_ctx`. If we swapped `_ctx` first, a
// Re-register the cache stats provider on the (preserved) counters so the status // snapshot taken in the gap between the two writes would still hold the OLD
// page sees the new cache's count/bytes immediately. Pass null when the new context // adapter wrapping the OLD cache — which the supervisor is about to dispose
// opted out of caching to clear any stale provider from the previous context. // (`PlcListenerSupervisor.ReplaceContextAsync` runs `oldCache.Dispose()` after we
// return). Setting the provider first means snapshots in the swap window read
// either (old provider, old ctx) or (new provider, new ctx) — both coherent —
// never (old provider after old cache disposed).
//
// In the typical reseat case `oldContext.Counters == newContext.Counters` (the
// reconciler preserves counters across reseat), so this updates the same instance
// both paths share. The order still matters because the snapshot reads the
// provider field, not the per-context counters reference.
newContext.Counters.SetCacheStatsProvider( newContext.Counters.SetCacheStatsProvider(
newContext.Cache is not null ? new CacheStatsAdapter(newContext.Cache) : null); newContext.Cache is not null ? new CacheStatsAdapter(newContext.Cache) : null);
_ctx = newContext;
} }
/// <summary> /// <summary>
@@ -240,7 +275,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
_pipes.Clear(); _pipes.Clear();
_disposeCts.Dispose(); // Guard the CTS dispose against a watchdog tick that raced past the WaitAsync
// above (e.g. a slow Task.Delay completion observing cancellation late). Also
// dispose the connect-gate semaphore.
try { _disposeCts.Dispose(); } catch (ObjectDisposedException) { /* already disposed */ }
try { _connectGate.Dispose(); } catch (ObjectDisposedException) { /* already disposed */ }
} }
// ── Backend connect / teardown ──────────────────────────────────────────── // ── Backend connect / teardown ────────────────────────────────────────────
@@ -264,6 +303,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Build a fresh backend socket and Polly-connect. // Build a fresh backend socket and Polly-connect.
var backend = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp) var backend = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp)
{ NoDelay = true }; { NoDelay = true };
SocketKeepalive.Apply(backend, _keepaliveOptions());
try try
{ {
@@ -300,8 +340,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{ {
_backendSocket = backend; _backendSocket = backend;
_backendCts = cts2; _backendCts = cts2;
// Seed the idle timer so the heartbeat loop measures idleness from connect.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
_backendWriterTask = Task.Run(() => RunBackendWriterAsync(backend, cts2.Token), CancellationToken.None); _backendWriterTask = Task.Run(() => RunBackendWriterAsync(backend, cts2.Token), CancellationToken.None);
_backendReaderTask = Task.Run(() => RunBackendReaderAsync(backend, cts2.Token), CancellationToken.None); _backendReaderTask = Task.Run(() => RunBackendReaderAsync(backend, cts2.Token), CancellationToken.None);
_backendHeartbeatTask = Task.Run(() => RunBackendHeartbeatAsync(cts2.Token), CancellationToken.None);
} }
_ctx.Counters.IncrementConnectSuccess(); _ctx.Counters.IncrementConnectSuccess();
@@ -318,27 +361,65 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
private async Task TearDownBackendAsync(string reason, bool cascadeUpstreams) private async Task TearDownBackendAsync(string reason, bool cascadeUpstreams)
{ {
// Phase 12 (W1.4) — serialise tear-down vs connect-up via the connect gate. Without // Serialise tear-down vs connect-up via the connect gate. Without this, a fresh
// this, a fresh EnsureBackendConnectedAsync racing with the channel drain below // EnsureBackendConnectedAsync racing with the channel drain below could see
// could see stranded frames sent on its new socket with old (already-released) TxIds, // stranded frames sent on its new socket with old (already-released) TxIds,
// producing orphaned responses that hang upstream peers via the watchdog. // producing orphaned responses that hang upstream peers via the watchdog.
await _connectGate.WaitAsync().ConfigureAwait(false); //
// Bounded wait: a long Polly-wrapped EnsureBackendConnectedAsync against an
// unreachable host can hold the gate for the full BackendConnectTimeoutMs *
// MaxAttempts window, blocking DisposeAsync (and therefore ProxyWorker.StopAsync)
// for that duration. A 2 s teardown deadline bounds disposal latency; if the gate
// is unavailable we proceed best-effort without it (the worst-case consequence is
// one orphaned in-flight cycle on the dying backend, which the upstream watchdog
// will surface as exception 0x0B).
//
// KNOWN RACE on the gate-not-held path: a concurrent EnsureBackendConnectedAsync
// that DOES hold the gate may TryAllocate a TxId that collides (after wraparound
// in the allocator's forward scan) with a TxId we're about to release from the
// channel-drain step below. The double-release would mark the new request's slot
// as free even though it's legitimately in-flight, allowing the next allocation
// to reuse the same slot and CorrelationMap.TryAdd to fail (silent request drop).
// Probability is very low (requires gate timeout + new accept landing during
// cascade + TxId collision in a 65,536-slot space); the only consequence is one
// dropped request that the client retries. Accepted as best-effort behaviour.
bool gateHeld = false;
try
{
using var teardownCts = new CancellationTokenSource(TimeSpan.FromSeconds(2));
await _connectGate.WaitAsync(teardownCts.Token).ConfigureAwait(false);
gateHeld = true;
}
catch (OperationCanceledException)
{
// Best-effort: proceed without the gate. Concurrent connect attempts will
// observe _disposed (or the now-null _backendSocket) and short-circuit.
}
catch (ObjectDisposedException)
{
// _connectGate already disposed — TearDown is racing past DisposeAsync.
// Skip the body entirely; there's nothing useful to do at this point.
return;
}
try try
{ {
Socket? oldSocket; Socket? oldSocket;
CancellationTokenSource? oldCts; CancellationTokenSource? oldCts;
Task? writer, reader; Task? writer, reader, heartbeat;
lock (_backendLock) lock (_backendLock)
{ {
oldSocket = _backendSocket; oldSocket = _backendSocket;
oldCts = _backendCts; oldCts = _backendCts;
writer = _backendWriterTask; writer = _backendWriterTask;
reader = _backendReaderTask; reader = _backendReaderTask;
heartbeat = _backendHeartbeatTask;
_backendSocket = null; _backendSocket = null;
_backendCts = null; _backendCts = null;
_backendWriterTask = null; _backendWriterTask = null;
_backendReaderTask = null; _backendReaderTask = null;
_backendHeartbeatTask = null;
} }
if (oldSocket is null && oldCts is null) return; if (oldSocket is null && oldCts is null) return;
@@ -356,8 +437,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_allocator.Release(kvp.Key); _allocator.Release(kvp.Key);
} }
// Phase 10 — also drain the in-flight-by-key map so a brand-new identical request // Also drain the in-flight-by-key map so a brand-new identical request through
// through the freshly-reconnected backend is treated as a miss (no stale entries // the freshly-reconnected backend is treated as a miss (no stale entries
// outlive the backend they were destined for). // outlive the backend they were destined for).
_inFlightByKey.DrainAll(); _inFlightByKey.DrainAll();
@@ -366,7 +447,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{ {
// Close every attached pipe that had a request in flight; the others will // Close every attached pipe that had a request in flight; the others will
// simply re-issue on next request through a fresh backend connect. // simply re-issue on next request through a fresh backend connect.
// Per the design doc, ALL attached upstreams cascade on backend disconnect. // Per docs/Architecture/ConnectionModel.md, ALL attached upstreams cascade on backend disconnect.
upstreamCount = _pipes.Count; upstreamCount = _pipes.Count;
// Snapshot keys before disposal modifies the dictionary indirectly. // Snapshot keys before disposal modifies the dictionary indirectly.
@@ -381,11 +462,11 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_ctx.Counters.AddDisconnectCascades(upstreamCount); _ctx.Counters.AddDisconnectCascades(upstreamCount);
} }
// Phase 12 (W1.4) — drain any stranded frames left in the outbound channel by // Drain any stranded frames left in the outbound channel by the writer task
// the writer task that just faulted/cancelled. Released their proxy TxIds back // that just faulted/cancelled. Release their proxy TxIds back to the
// to the allocator. By the time we reach this line the writer has stopped // allocator. By the time we reach this line the writer has stopped reading
// reading from the channel (cancelled CTS) and the upstream pipes have been // from the channel (cancelled CTS) and the upstream pipes have been cascaded
// cascaded (no more enqueues), so the channel state is stable. // (no more enqueues), so the channel state is stable.
int strandedDropped = 0; int strandedDropped = 0;
while (_outboundChannel.Reader.TryRead(out byte[]? stranded)) while (_outboundChannel.Reader.TryRead(out byte[]? stranded))
{ {
@@ -400,6 +481,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Best-effort join. // Best-effort join.
try { if (writer is not null) await writer.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ } try { if (writer is not null) await writer.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
try { if (reader is not null) await reader.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ } try { if (reader is not null) await reader.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
try { if (heartbeat is not null) await heartbeat.WaitAsync(TimeSpan.FromSeconds(2)).ConfigureAwait(false); } catch { /* swallow */ }
oldCts?.Dispose(); oldCts?.Dispose();
@@ -408,7 +490,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
finally finally
{ {
_connectGate.Release(); // Only release if we acquired — best-effort path may have skipped.
if (gateHeld)
{
try { _connectGate.Release(); }
catch (ObjectDisposedException) { /* dispose race — harmless */ }
}
} }
} }
@@ -430,6 +517,9 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
if (n == 0) throw new SocketException((int)SocketError.ConnectionReset); if (n == 0) throw new SocketException((int)SocketError.ConnectionReset);
sent += n; sent += n;
} }
// A send counts as backend activity — it suppresses the idle heartbeat.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
} }
} }
catch (OperationCanceledException) catch (OperationCanceledException)
@@ -438,8 +528,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
catch (Exception ex) catch (Exception ex)
{ {
// Backend failure — cascade. // Backend failure — cascade. Skip if disposal is already in progress;
_ = TearDownBackendAsync($"writer fault: {ex.Message}", cascadeUpstreams: true); // DisposeAsync runs an explicit TearDown and the fire-and-forget here would
// race against it, hitting a disposed _connectGate and producing an
// unobserved-task exception.
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync($"writer fault: {ex.Message}", cascadeUpstreams: true);
} }
} }
@@ -479,6 +573,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
if (!await FillAsync(backend, frame, MbapFrame.HeaderSize, pduBodyLen, ct).ConfigureAwait(false)) if (!await FillAsync(backend, frame, MbapFrame.HeaderSize, pduBodyLen, ct).ConfigureAwait(false))
break; break;
// A received frame counts as backend activity — it suppresses (and, for a
// heartbeat response, satisfies) the idle heartbeat.
Interlocked.Exchange(ref _lastBackendActivityTicks, DateTime.UtcNow.Ticks);
if (!_correlation.TryRemove(proxyTxId, out var inFlight)) if (!_correlation.TryRemove(proxyTxId, out var inFlight))
{ {
// No correlation entry — either a stale response after cascade, or // No correlation entry — either a stale response after cascade, or
@@ -489,10 +587,18 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
// Free the allocator slot immediately so it can be reused. // Free the allocator slot immediately so it can be reused.
_allocator.Release(proxyTxId); _allocator.Release(proxyTxId);
// Phase 10 — for FC03/FC04 reads, also clear the coalescing-by-key entry so // Keepalive heartbeat response — the probe came back, the backend is alive.
// a brand-new identical request issued AFTER this response is treated as a // The activity timestamp was already refreshed above. There is no upstream
// miss (opens a fresh round-trip). The TryRemove is best-effort: a watchdog // party, no cache eligibility, and no rewriting to do: drop the payload and
// timeout or cascade may have already removed it. // skip the EWMA update so the synthetic probe never pollutes the
// client-facing round-trip metric.
if (inFlight.IsHeartbeat)
continue;
// For FC03/FC04 reads, also clear the coalescing-by-key entry so a
// brand-new identical request issued AFTER this response is treated as a
// miss (opens a fresh round-trip). The TryRemove is best-effort: a
// watchdog timeout or cascade may have already removed it.
if (inFlight.Fc is 0x03 or 0x04) if (inFlight.Fc is 0x03 or 0x04)
{ {
var coalKey = new CoalescingKey(inFlight.UnitId, inFlight.Fc, var coalKey = new CoalescingKey(inFlight.UnitId, inFlight.Fc,
@@ -500,11 +606,9 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_inFlightByKey.TryRemove(coalKey, out _); _inFlightByKey.TryRemove(coalKey, out _);
} }
// Update EWMA round-trip from when we sent the request. // Update EWMA round-trip from when we sent the request. UpdateRoundTripEwma
long elapsedMs = (DateTimeOffset.UtcNow - inFlight.SentAtUtc).Ticks * 100; // 100 ns per tick // expects Stopwatch ticks; convert from the wall-clock SentAtUtc timestamp.
// UpdateRoundTripEwma expects Stopwatch ticks, but we have wall-clock. long ticks = (long)((DateTimeOffset.UtcNow - inFlight.SentAtUtc).TotalSeconds * Stopwatch.Frequency);
// Convert ms back to Stopwatch ticks:
long ticks = (long)((double)(DateTimeOffset.UtcNow - inFlight.SentAtUtc).TotalSeconds * Stopwatch.Frequency);
if (ticks > 0) if (ticks > 0)
_ctx.Counters.UpdateRoundTripEwma(ticks); _ctx.Counters.UpdateRoundTripEwma(ticks);
@@ -517,14 +621,19 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
frame.AsSpan(MbapFrame.HeaderSize, pduBodyLen), frame.AsSpan(MbapFrame.HeaderSize, pduBodyLen),
responseCtx); responseCtx);
// Phase 11 — post-rewriter cache update: // Post-rewriter cache update:
// * FC03/FC04 successful responses are stored when the request was // * FC03/FC04 successful responses are stored when the request was
// cache-eligible (resolvedTtlMs > 0). // cache-eligible (resolvedTtlMs > 0).
// * FC06/FC16 successful responses invalidate every cached entry whose // * FC06/FC16 successful responses invalidate every cached entry whose
// address range overlaps the write. // address range overlaps the write.
//
// Exception bit comes from the post-rewriter buffer (the rewriter never
// touches the FC byte today, but reading from inFlight.Fc would lose the
// exception bit). The base FC for routing decisions uses inFlight.Fc —
// the request side knows what was sent.
if (_ctx.Cache is { } postCache) if (_ctx.Cache is { } postCache)
{ {
byte fcInResponse = frame[MbapFrame.HeaderSize]; // post-rewriter, but the FC byte is never rewritten byte fcInResponse = frame[MbapFrame.HeaderSize];
bool isException = (fcInResponse & 0x80) != 0; bool isException = (fcInResponse & 0x80) != 0;
if (!isException) if (!isException)
@@ -555,6 +664,16 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
else if (inFlight.Fc is 0x06 or 0x10) else if (inFlight.Fc is 0x06 or 0x10)
{ {
// The design contract "invalidations during a recovering
// listener state are skipped" is upheld IMPLICITLY here:
// invalidation only fires inside the backend reader task when
// a non-exception FC06/FC16 response arrives. A `Recovering`
// listener has no backend reader (the multiplexer is torn
// down between recovery attempts), so no response can land
// here, so no invalidation. The gating is structural, not
// conditional. If a future change ever produces a write
// response off the live backend, an explicit recovering-state
// check would need to be added.
int invalidated = postCache.Invalidate( int invalidated = postCache.Invalidate(
inFlight.UnitId, inFlight.StartAddress, inFlight.Qty); inFlight.UnitId, inFlight.StartAddress, inFlight.Qty);
if (invalidated > 0) if (invalidated > 0)
@@ -569,23 +688,23 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
// Fan out to each interested party with their original TxId restored. // Fan out to each interested party with their original TxId restored.
// Phase 9: always exactly one party. Phase 10: N parties (read coalescing). // Without coalescing there is exactly one party; with coalescing there
// Note: the InFlightByKey TryRemove above (for FC03/FC04) guarantees no // are N. The InFlightByKey TryRemove above (for FC03/FC04) guarantees no
// further attaches can occur — the parties list is now a stable snapshot. // further attaches can occur — the parties list is now a stable snapshot.
// //
// Phase 12 (W1.3) — non-blocking fan-out via `TrySendResponse`. The // Non-blocking fan-out via `TrySendResponse`. The single backend reader
// single backend reader task must NEVER `await` a per-upstream channel // task must NEVER `await` a per-upstream channel write: a wedged upstream
// write: a wedged upstream (full bounded response channel) would otherwise // (full bounded response channel) would otherwise stall the reader and
// stall the reader and starve every other client on this PLC. A drop here // starve every other client on this PLC. A drop here is recorded via
// is recorded via `responseDropForFullUpstream`; the wedged upstream loses // `responseDropForFullUpstream`; the wedged upstream loses its own
// its own response and will be reaped by its own socket-close path. // response and will be reaped by its own socket-close path.
foreach (var party in inFlight.InterestedParties) foreach (var party in inFlight.InterestedParties)
{ {
if (!party.Pipe.IsAlive) if (!party.Pipe.IsAlive)
{ {
// Phase 10 — record the dead-upstream skip only for FC03/FC04 (the // Record the dead-upstream skip only for FC03/FC04 (the only
// only function codes that take the coalescing path). For non- // function codes that take the coalescing path). For
// coalescing FCs this branch is silent — the Phase-9 behaviour. // non-coalescing FCs this branch is silent.
if (inFlight.Fc is 0x03 or 0x04 if (inFlight.Fc is 0x03 or 0x04
&& inFlight.InterestedParties.Count > 1) && inFlight.InterestedParties.Count > 1)
{ {
@@ -597,10 +716,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
continue; continue;
} }
// The frame buffer is private to this iteration; if there are multiple // The frame buffer is private to this iteration; if there are
// parties (Phase 10), each gets its own copy with its own original TxId // multiple coalesced parties, each gets its own copy with its own
// patched in. Phase 9 always has Count == 1, so the single-buffer path // original TxId patched in. The single-party case reuses the buffer
// is the common case; we copy to keep Phase-10 forward compatibility. // directly as the common-case fast path.
byte[] outFrame = inFlight.InterestedParties.Count == 1 byte[] outFrame = inFlight.InterestedParties.Count == 1
? frame ? frame
: (byte[])frame.Clone(); : (byte[])frame.Clone();
@@ -612,11 +731,20 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
{ {
_ctx.Counters.IncrementResponseDropForFullUpstream(); _ctx.Counters.IncrementResponseDropForFullUpstream();
} }
else
{
// Count outbound bytes per delivered party. With coalescing, one
// backend response fans out to N parties and produces
// N × frame.Length bytes leaving the proxy upstream-side.
_ctx.Counters.AddBytes(up: 0, down: outFrame.Length);
}
} }
} }
// Reader exited cleanly — backend closed by remote. Cascade. // Reader exited cleanly — backend closed by remote. Cascade. Skip if
_ = TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true); // dispose is already in progress (see writer-side comment above).
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync("backend reader EOF", cascadeUpstreams: true);
} }
catch (OperationCanceledException) catch (OperationCanceledException)
{ {
@@ -624,7 +752,8 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
catch (Exception ex) catch (Exception ex)
{ {
_ = TearDownBackendAsync($"reader fault: {ex.Message}", cascadeUpstreams: true); if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync($"reader fault: {ex.Message}", cascadeUpstreams: true);
} }
} }
@@ -641,11 +770,20 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
out ushort originalTxId, out _, out _, out byte unitId)) out ushort originalTxId, out _, out _, out byte unitId))
return; return;
// Parse the PDU FC + start/qty. FC03/FC04 reads use start/qty for the coalescing key // Remember the unit ID so the backend keepalive heartbeat probes the same Modbus
// and (Phase 11) for the cache lookup. FC06 writes carry [addr][value]; we treat qty // unit the real clients are known to reach successfully.
// as 1 for invalidation. FC16 carries [start][qty][byteCount]...; qty is the write Volatile.Write(ref _lastSeenUnitId, unitId);
// span used for cache invalidation. Phase 11: FC06/FC16 start/qty drive cache
// invalidation by overlap rather than exact key. // Count inbound bytes from the upstream client. Surfaces in bytes.upstreamIn on
// the status page. Counted ONCE per parsed frame regardless of subsequent
// routing (cache hit, coalesce, backend round-trip, exception).
_ctx.Counters.AddBytes(up: frame.Length, down: 0);
// Parse the PDU FC + start/qty. FC03/FC04 reads use start/qty for the coalescing
// key and for the cache lookup. FC06 writes carry [addr][value]; we treat qty as
// 1 for invalidation. FC16 carries [start][qty][byteCount]...; qty is the write
// span used for cache invalidation. FC06/FC16 start/qty drive cache invalidation
// by overlap rather than exact key.
int pduOffset = MbapFrame.HeaderSize; int pduOffset = MbapFrame.HeaderSize;
byte fcByte = frame.Length > pduOffset ? frame[pduOffset] : (byte)0; byte fcByte = frame.Length > pduOffset ? frame[pduOffset] : (byte)0;
ushort startAddr = 0; ushort startAddr = 0;
@@ -669,12 +807,12 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
qty = (ushort)((frame[pduOffset + 3] << 8) | frame[pduOffset + 4]); qty = (ushort)((frame[pduOffset + 3] << 8) | frame[pduOffset + 4]);
} }
// Phase 11 — response-cache path. Cache check happens BEFORE coalescing AND before // Response-cache path. Cache check happens BEFORE coalescing AND before we
// we attempt to bring up the backend connection. A hit short-circuits everything, // attempt to bring up the backend connection. A hit short-circuits everything,
// including the EnsureBackendConnectedAsync call — operators with all reads cached // including the EnsureBackendConnectedAsync call — operators with all reads
// and the backend down still get served (the cache survives backend disconnects per // cached and the backend down still get served (the cache survives backend
// the design contract). The cache only fires for FC03/FC04 and only when the read // disconnects per the design contract). The cache only fires for FC03/FC04 and
// range's resolved TTL > 0. // only when the read range's resolved TTL > 0.
int resolvedCacheTtlMs = 0; int resolvedCacheTtlMs = 0;
if (fcByte is 0x03 or 0x04 && _ctx.Cache is { } responseCache) if (fcByte is 0x03 or 0x04 && _ctx.Cache is { } responseCache)
{ {
@@ -689,25 +827,32 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
byte[] hitFrame = BuildCacheHitFrame(originalTxId, unitId, cached.PduBytes); byte[] hitFrame = BuildCacheHitFrame(originalTxId, unitId, cached.PduBytes);
await pipe.SendResponseAsync(hitFrame, ct).ConfigureAwait(false); await pipe.SendResponseAsync(hitFrame, ct).ConfigureAwait(false);
// Outbound bytes for cache-hit response.
_ctx.Counters.AddBytes(up: 0, down: hitFrame.Length);
return; return;
} }
// Per design contract: "miss" = "fell through to coalescing/backend".
// When two upstream peers issue the same cache-eligible read, both increment
// CacheMiss; only one then opens a backend round-trip (the second coalesces
// onto the first via the InFlightByKey path below). So `CacheMiss` does NOT
// equal "produced a backend round-trip" — it equals "did not find a fresh
// cache entry". The identity `Hit + Miss = cache-eligible requests` holds.
_ctx.Counters.IncrementCacheMiss(); _ctx.Counters.IncrementCacheMiss();
CacheLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty); CacheLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty);
} }
} }
// Ensure backend is connected. Failure here means we cannot service the request; // Ensure backend is connected. Failure here means we cannot service the request;
// close the upstream pipe (consistent with the 1:1 model's behaviour on connect // close the upstream pipe.
// failure).
if (!await EnsureBackendConnectedAsync(ct).ConfigureAwait(false)) if (!await EnsureBackendConnectedAsync(ct).ConfigureAwait(false))
{ {
try { await pipe.DisposeAsync().ConfigureAwait(false); } catch { /* best effort */ } try { await pipe.DisposeAsync().ConfigureAwait(false); } catch { /* best effort */ }
return; return;
} }
// Phase 10 — read-coalescing path. Only FC03/FC04 are coalescable; only when the // Read-coalescing path. Only FC03/FC04 are coalescable; only when the feature
// feature is enabled in the live config. If the late-arriving request matches an // is enabled in the live config. If the late-arriving request matches an
// already-in-flight peer, we attach to the existing entry and skip the backend // already-in-flight peer, we attach to the existing entry and skip the backend
// round-trip entirely. The existing entry's response will fan out to both parties. // round-trip entirely. The existing entry's response will fan out to both parties.
var coalescingOpts = _coalescingOptions(); var coalescingOpts = _coalescingOptions();
@@ -716,18 +861,18 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
var key = new CoalescingKey(unitId, fcByte, startAddr, qty); var key = new CoalescingKey(unitId, fcByte, startAddr, qty);
var newParty = new InterestedParty(pipe, originalTxId); var newParty = new InterestedParty(pipe, originalTxId);
// The factory does the Phase-9 work: allocate a proxy TxId, build the // The factory allocates a proxy TxId, builds the InFlightRequest with a
// InFlightRequest with a mutable List<InterestedParty>, add to the correlation // mutable List<InterestedParty>, and adds to the correlation map. We
// map. We deliberately do NOT enqueue to the outbound channel inside the // deliberately do NOT enqueue to the outbound channel inside the factory —
// factory — that's done outside the InFlightByKey lock to keep the lock // that's done outside the InFlightByKey lock to keep the lock scope tight
// scope tight and to avoid holding the lock across an async send. // and to avoid holding the lock across an async send.
// //
// proxyTxIdForSend / inFlightForSend communicate the factory's allocation back // proxyTxIdForSend / inFlightForSend communicate the factory's allocation
// out of the lock so the post-lock code can finish the send. // back out of the lock so the post-lock code can finish the send.
ushort proxyTxIdForSend = 0; ushort proxyTxIdForSend = 0;
InFlightRequest? inFlightForSend = null; InFlightRequest? inFlightForSend = null;
_inFlightByKey.TryAttachOrCreate( _inFlightByKey.AttachOrCreate(
key, key,
newParty, newParty,
factory: () => factory: () =>
@@ -786,43 +931,48 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return; return;
} }
// Coalesce miss: we just opened a fresh in-flight entry. // Coalesce miss: this request did not attach to an in-flight peer. Per the
// design contract `coalescedHitCount + coalescedMissCount = total FC03/FC04`,
// so even saturation-failure paths (factory below returns null inFlightForSend)
// count as a miss — every FC03/FC04 entered the coalescing path exactly once.
// "Miss" here means "did not coalesce", NOT "produced a backend round-trip".
_ctx.Counters.IncrementCoalescedMiss(); _ctx.Counters.IncrementCoalescedMiss();
CoalescingLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty); CoalescingLogEvents.Miss(_logger, _plc.Name, unitId, fcByte, startAddr, qty);
if (inFlightForSend is null) if (inFlightForSend is null)
{ {
// Phase 12 (W1.2) — the factory hit the allocator-saturation path or a // The factory hit the allocator-saturation path or a duplicate-key race
// duplicate-key race and stored a stub `InFlightRequest` under `key`. Late // and stored a stub `InFlightRequest` under `key`. Late attachers may
// attachers may have joined the stub between the factory call and this // have joined the stub between the factory call and this cleanup; we
// cleanup; we must deliver the saturation exception to ALL of them, not just // must deliver the saturation exception to ALL of them, not just the
// the leader, otherwise the late attachers wait forever for a response that // leader, otherwise the late attachers wait forever for a response that
// never comes (the stub has no proxy TxId, so no backend round-trip will // never comes (the stub has no proxy TxId, so no backend round-trip will
// ever fire). // ever fire).
MultiplexerLogEvents.Saturated(_logger, _plc.Name, pipe.RemoteEp?.ToString() ?? "?"); MultiplexerLogEvents.Saturated(_logger, _plc.Name, pipe.RemoteEp?.ToString() ?? "?");
if (_inFlightByKey.TryRemove(key, out var stub)) if (_inFlightByKey.TryRemove(key, out var stub))
{ {
// Non-blocking delivery via TrySendResponse — the per-PLC fan-out
// path must never await per-pipe writes (a wedged late-attacher's
// full bounded channel would otherwise stall delivery to its peers).
foreach (var party in stub.InterestedParties) foreach (var party in stub.InterestedParties)
{ {
byte[] excFrame = BuildExceptionFrame(party.OriginalTxId, unitId, fcByte, exceptionCode: 4); byte[] excFrame = BuildExceptionFrame(party.OriginalTxId, unitId, fcByte, exceptionCode: 4);
try if (!party.Pipe.TrySendResponse(excFrame))
{ _ctx.Counters.IncrementResponseDropForFullUpstream();
await party.Pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); else
} _ctx.Counters.AddBytes(up: 0, down: excFrame.Length);
catch
{
// Best-effort delivery. A dead pipe will be collected by its own
// socket close path; nothing more we can do here.
}
} }
} }
else else
{ {
// The stub was already removed by another path (extremely unlikely, but // The stub was already removed by another path (extremely unlikely,
// defensive). Surface the exception to the original requester. // but defensive). Surface the exception to the original requester.
byte[] excFrame = BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4); byte[] excFrame = BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4);
await pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); if (!pipe.TrySendResponse(excFrame))
_ctx.Counters.IncrementResponseDropForFullUpstream();
else
_ctx.Counters.AddBytes(up: 0, down: excFrame.Length);
} }
return; return;
} }
@@ -853,15 +1003,16 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return; return;
} }
// Non-coalescing path (FC06/FC16 writes, FC03/04 with coalescing disabled, or any // Non-coalescing path (FC06/FC16 writes, FC03/04 with coalescing disabled, or
// other FC). This is the Phase-9 path verbatim — every request gets its own proxy // any other FC). Every request gets its own proxy TxId and its own backend
// TxId and its own backend round-trip. // round-trip.
if (!_allocator.TryAllocate(out ushort proxyTxIdFc)) if (!_allocator.TryAllocate(out ushort proxyTxIdFc))
{ {
MultiplexerLogEvents.Saturated(_logger, _plc.Name, pipe.RemoteEp?.ToString() ?? "?"); MultiplexerLogEvents.Saturated(_logger, _plc.Name, pipe.RemoteEp?.ToString() ?? "?");
byte[] excFrame = BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4); byte[] excFrame = BuildExceptionFrame(originalTxId, unitId, fcByte, exceptionCode: 4);
await pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); await pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false);
_ctx.Counters.AddBytes(up: 0, down: excFrame.Length);
return; return;
} }
@@ -883,10 +1034,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
return; return;
} }
// Phase 10 — even when the coalescing path is bypassed (e.g. coalescing disabled // Even when the coalescing path is bypassed (e.g. coalescing disabled for
// for FC03/04), we still report the request as a Miss so Hit + Miss = total // FC03/04), we still report the request as a Miss so Hit + Miss = total
// FC03/FC04 requests across snapshots. FC06/FC16 are not counted here (they are // FC03/FC04 requests across snapshots. FC06/FC16 are not counted here (they
// not coalescable in any sense). // are not coalescable in any sense).
if (fcByte is 0x03 or 0x04) if (fcByte is 0x03 or 0x04)
_ctx.Counters.IncrementCoalescedMiss(); _ctx.Counters.IncrementCoalescedMiss();
@@ -927,17 +1078,15 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
/// Modbus exception (code 0x0B / Gateway Target Device Failed To Respond) to each /// Modbus exception (code 0x0B / Gateway Target Device Failed To Respond) to each
/// interested party with the original TxId restored. /// interested party with the original TxId restored.
/// ///
/// <para><b>Why this exists.</b> In the 1:1 connection model, a lost response would /// <para><b>Why this exists.</b> In a multiplexed connection model a single missing
/// fault the dedicated backend socket and the upstream pair would close. The multiplexed /// or mis-routed response would otherwise leak a correlation entry forever and hang
/// model needs an explicit per-request timer because a single missing or mis-routed /// the upstream pipe indefinitely. Real-world causes: PLC drops a response, network
/// response would otherwise leak a correlation entry forever and hang the upstream /// packet loss, backend that mis-echoes MBAP TxIds.</para>
/// pipe indefinitely. Real-world causes: PLC drops a response, network packet loss,
/// backend that mis-echoes MBAP TxIds.</para>
/// </summary> /// </summary>
private async Task RunRequestTimeoutWatchdogAsync(CancellationToken ct) private async Task RunRequestTimeoutWatchdogAsync(CancellationToken ct)
{ {
// Tick at ~quarter of the request timeout for responsive cleanup, but cap to a // Tick at ~quarter of the request timeout for responsive cleanup, but cap to a
// 1-second floor so the watchdog doesn't busy-wake on very small timeouts. // 100 ms floor so the watchdog doesn't busy-wake on very small timeouts.
int tickMs = Math.Max(100, _connectionOptions.BackendRequestTimeoutMs / 4); int tickMs = Math.Max(100, _connectionOptions.BackendRequestTimeoutMs / 4);
try try
@@ -960,10 +1109,27 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
_allocator.Release(proxyTxId); _allocator.Release(proxyTxId);
// Phase 10 — also clear the coalescing-by-key entry. A late attach that // Keepalive heartbeat that never came back. The backend is no longer
// raced in just before the watchdog claim will still receive the 0x0B // answering Modbus even though the socket may still look connected —
// exception via this entry's InterestedParties list (List<T> mutations // tear it down proactively (cascading every attached pipe) so the
// happen before fan-out begins). // failure is found here, during idle, instead of corrupting the next
// real client request. There is no upstream party to send a 0x0B to.
if (req.IsHeartbeat)
{
long hbElapsedMs = (long)(DateTimeOffset.UtcNow - req.SentAtUtc).TotalMilliseconds;
KeepaliveLogEvents.HeartbeatTimeout(_logger, _plc.Name, proxyTxId, hbElapsedMs);
_ctx.Counters.IncrementBackendHeartbeatFailed();
_ctx.Counters.IncrementBackendIdleDisconnect();
KeepaliveLogEvents.BackendIdleDisconnect(_logger, _plc.Name, hbElapsedMs);
if (!_disposeCts.IsCancellationRequested)
_ = TearDownBackendAsync("keepalive heartbeat timeout", cascadeUpstreams: true);
continue;
}
// Also clear the coalescing-by-key entry. A late attach that raced
// in just before the watchdog claim will still receive the 0x0B
// exception via this entry's InterestedParties list (List<T>
// mutations happen before fan-out begins).
if (req.Fc is 0x03 or 0x04) if (req.Fc is 0x03 or 0x04)
{ {
var coalKey = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty); var coalKey = new CoalescingKey(req.UnitId, req.Fc, req.StartAddress, req.Qty);
@@ -987,6 +1153,7 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
try try
{ {
await party.Pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false); await party.Pipe.SendResponseAsync(excFrame, ct).ConfigureAwait(false);
_ctx.Counters.AddBytes(up: 0, down: excFrame.Length);
} }
catch catch
{ {
@@ -1007,6 +1174,124 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
} }
// ── Backend keepalive heartbeat ───────────────────────────────────────────
/// <summary>
/// Backend keepalive heartbeat loop. Started alongside the writer/reader on each
/// successful connect and cancelled with them on teardown. While the backend socket
/// has been idle (no send or receive) for longer than
/// <see cref="KeepaliveOptions.BackendHeartbeatIdleMs"/>, it issues a synthetic FC03
/// qty=1 read so the path stays warm against middlebox idle-drop and a backend that is
/// connected-but-not-answering is detected here rather than on the next client request.
///
/// <para>The probe response is consumed by <see cref="RunBackendReaderAsync"/> (which
/// recognises <see cref="InFlightRequest.IsHeartbeat"/> and drops it); a probe that
/// never returns is timed out by <see cref="RunRequestTimeoutWatchdogAsync"/>, which
/// tears the backend down. The heartbeat keeps an <i>existing</i> backend warm — it
/// never resurrects a dead one (reconnect stays gated on the next upstream frame).</para>
/// </summary>
private async Task RunBackendHeartbeatAsync(CancellationToken ct)
{
try
{
while (!ct.IsCancellationRequested)
{
var ka = _keepaliveOptions();
int idleMs = Math.Max(1000, ka.BackendHeartbeatIdleMs);
// Tick at a quarter of the idle window so a freshly-elapsed idle period is
// noticed promptly, floored at 500 ms so the loop never busy-wakes.
int tickMs = Math.Max(500, idleMs / 4);
await Task.Delay(tickMs, ct).ConfigureAwait(false);
if (!ka.Enabled)
continue;
long lastTicks = Interlocked.Read(ref _lastBackendActivityTicks);
double idleElapsedMs =
(DateTime.UtcNow - new DateTime(lastTicks, DateTimeKind.Utc)).TotalMilliseconds;
if (idleElapsedMs < idleMs)
continue;
SendHeartbeat(ka);
}
}
catch (OperationCanceledException)
{
// Normal teardown.
}
catch (Exception ex)
{
_logger.LogError(ex, "Backend heartbeat loop faulted: Plc={Plc}", _plc.Name);
}
}
/// <summary>
/// Builds and enqueues one synthetic FC03 qty=1 heartbeat request onto the backend
/// outbound channel. The correlation entry is flagged <see cref="InFlightRequest.IsHeartbeat"/>
/// so the reader and watchdog treat it specially; it carries no interested parties and
/// bypasses the coalescing and cache paths entirely.
/// </summary>
private void SendHeartbeat(KeepaliveOptions ka)
{
// A saturated TxId space means the backend is busy (65,536 requests in flight),
// which is the opposite of idle — skip this tick rather than force a probe.
if (!_allocator.TryAllocate(out ushort proxyTxId))
return;
byte unitId = (byte)Volatile.Read(ref _lastSeenUnitId);
ushort address = (ushort)ka.BackendHeartbeatProbeAddress;
var inFlight = new InFlightRequest(
UnitId: unitId,
Fc: 0x03,
StartAddress: address,
Qty: 1,
InterestedParties: Array.Empty<InterestedParty>(),
SentAtUtc: DateTimeOffset.UtcNow,
ResolvedCacheTtlMs: 0,
IsHeartbeat: true);
if (!_correlation.TryAdd(proxyTxId, inFlight))
{
_allocator.Release(proxyTxId);
return;
}
byte[] frame = BuildHeartbeatFrame(proxyTxId, unitId, address);
// Non-blocking enqueue: if the channel is full the backend is not idle (a race), and
// if it is completed the backend is tearing down — either way, undo and skip.
if (!_outboundChannel.Writer.TryWrite(frame))
{
if (_correlation.TryRemove(proxyTxId, out _))
_allocator.Release(proxyTxId);
return;
}
_ctx.Counters.IncrementBackendHeartbeatSent();
KeepaliveLogEvents.HeartbeatSent(_logger, _plc.Name, proxyTxId, address);
}
/// <summary>
/// Builds a 12-byte MBAP-framed FC03 (Read Holding Registers) request reading one
/// register at <paramref name="address"/> — the keepalive heartbeat probe PDU.
/// </summary>
private static byte[] BuildHeartbeatFrame(ushort proxyTxId, byte unitId, ushort address)
{
// PDU = [fc=03][addrHi][addrLo][qtyHi][qtyLo]. MBAP length = UnitId(1) + PDU(5) = 6.
var frame = new byte[MbapFrame.HeaderSize + 5];
frame[0] = (byte)(proxyTxId >> 8);
frame[1] = (byte)(proxyTxId & 0xFF);
frame[2] = 0; frame[3] = 0; // ProtocolId
frame[4] = 0; frame[5] = 6; // Length
frame[6] = unitId;
frame[7] = 0x03; // FC03 Read Holding Registers
frame[8] = (byte)(address >> 8);
frame[9] = (byte)(address & 0xFF);
frame[10] = 0; frame[11] = 1; // Qty = 1
return frame;
}
// ── Helpers ─────────────────────────────────────────────────────────────── // ── Helpers ───────────────────────────────────────────────────────────────
/// <summary> /// <summary>
@@ -1039,10 +1324,10 @@ internal sealed class PlcMultiplexer : IAsyncDisposable, IMultiplexCountersProvi
} }
/// <summary> /// <summary>
/// Phase 11 — builds an MBAP-framed response from cached PDU bytes for the given /// Builds an MBAP-framed response from cached PDU bytes for the given upstream
/// upstream party. The cache stores POST-rewriter PDU bodies (no MBAP); each hit /// party. The cache stores POST-rewriter PDU bodies (no MBAP); each hit stamps a
/// stamps a fresh MBAP header carrying the requesting party's original TxId so the /// fresh MBAP header carrying the requesting party's original TxId so the response
/// response looks indistinguishable from a fresh backend reply. /// looks indistinguishable from a fresh backend reply.
/// </summary> /// </summary>
private static byte[] BuildCacheHitFrame(ushort originalTxId, byte unitId, byte[] cachedPdu) private static byte[] BuildCacheHitFrame(ushort originalTxId, byte unitId, byte[] cachedPdu)
{ {
@@ -1,6 +1,7 @@
using System.Net; using System.Net;
using System.Net.Sockets; using System.Net.Sockets;
using System.Threading.Channels; using System.Threading.Channels;
using Mbproxy.Options;
namespace Mbproxy.Proxy.Multiplexing; namespace Mbproxy.Proxy.Multiplexing;
@@ -49,10 +50,11 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
// Internal CTS lets the multiplexer signal "drop this pipe now" without waiting for // Internal CTS lets the multiplexer signal "drop this pipe now" without waiting for
// the upstream socket to close cleanly. // the upstream socket to close cleanly.
private readonly CancellationTokenSource _cts = new(); private readonly CancellationTokenSource _cts = new();
private bool _disposed; // Volatile so writes from DisposeAsync are observed by IsAlive / TrySendResponse on
// other threads without a fence.
private volatile bool _disposed;
// Phase 9: per-pipe forwarded-PDU counter (replaces the per-pair counter from the // Per-pipe forwarded-PDU counter. Read by the status page.
// 1:1 model). Read by the status page.
private long _pdusForwardedCount; private long _pdusForwardedCount;
/// <summary>Stable identity for status-page reporting and cascade cleanup.</summary> /// <summary>Stable identity for status-page reporting and cascade cleanup.</summary>
@@ -76,10 +78,15 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
/// </summary> /// </summary>
public bool IsAlive => !_disposed && !_cts.IsCancellationRequested; public bool IsAlive => !_disposed && !_cts.IsCancellationRequested;
public UpstreamPipe(Socket upstream, string plcName, ILogger logger) public UpstreamPipe(Socket upstream, string plcName, ILogger logger, KeepaliveOptions? keepalive = null)
{ {
_upstream = upstream; _upstream = upstream;
_upstream.NoDelay = true; _upstream.NoDelay = true;
// Enable OS TCP keepalive on the accepted client socket so a half-open/dead
// client (gone without a TCP FIN) faults the read loop and is reaped, instead of
// leaking a pipe + correlation slots until the proxy next tries to write to it.
if (keepalive is not null)
SocketKeepalive.Apply(_upstream, keepalive);
RemoteEp = upstream.RemoteEndPoint as IPEndPoint; RemoteEp = upstream.RemoteEndPoint as IPEndPoint;
_plcName = plcName; _plcName = plcName;
_logger = logger; _logger = logger;
@@ -225,11 +232,11 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
} }
/// <summary> /// <summary>
/// Phase 12 (W1.3) — non-blocking response enqueue. Returns <c>true</c> when the frame /// Non-blocking response enqueue. Returns <c>true</c> when the frame was queued for
/// was queued for delivery, <c>false</c> when the pipe is dead OR the response channel /// delivery, <c>false</c> when the pipe is dead OR the response channel is full.
/// is full. Used by the per-PLC backend reader's fan-out loop so a single wedged /// Used by the per-PLC backend reader's fan-out loop so a single wedged upstream
/// upstream cannot stall responses to peers sharing the same backend socket — without /// cannot stall responses to peers sharing the same backend socket — without this, a
/// this, a full <c>_responseChannel</c> on one pipe would block the reader task. /// full <c>_responseChannel</c> on one pipe would block the reader task.
/// ///
/// <para>A <c>false</c> return indicates the frame is the multiplexer's responsibility /// <para>A <c>false</c> return indicates the frame is the multiplexer's responsibility
/// to drop and (optionally) account for via a counter. The wedged upstream's socket /// to drop and (optionally) account for via a counter. The wedged upstream's socket
@@ -270,8 +277,6 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
Socket socket, byte[] buf, int offset, int count, CancellationToken ct) Socket socket, byte[] buf, int offset, int count, CancellationToken ct)
{ {
int remaining = count; int remaining = count;
bool firstRead = true;
while (remaining > 0) while (remaining > 0)
{ {
int received = await socket.ReceiveAsync( int received = await socket.ReceiveAsync(
@@ -279,11 +284,11 @@ internal sealed partial class UpstreamPipe : IAsyncDisposable
SocketFlags.None, SocketFlags.None,
ct).ConfigureAwait(false); ct).ConfigureAwait(false);
// Clean EOF (pre-frame or mid-frame) — caller treats both the same.
if (received == 0) if (received == 0)
return firstRead && remaining == count ? false : false; return false;
remaining -= received; remaining -= received;
firstRead = false;
} }
return true; return true;
+2 -3
View File
@@ -2,8 +2,8 @@ namespace Mbproxy.Proxy;
/// <summary> /// <summary>
/// No-op PDU pipeline: passes every frame through byte-for-byte without rewriting. /// No-op PDU pipeline: passes every frame through byte-for-byte without rewriting.
/// Registered as the <see cref="IPduPipeline"/> singleton in Phase 03. /// Used by tests and fallback paths; production wires
/// Phase 04 replaces this registration with BcdPduPipeline. /// <see cref="BcdPduPipeline"/> as the <see cref="IPduPipeline"/> singleton.
/// </summary> /// </summary>
internal sealed class NoopPduPipeline : IPduPipeline internal sealed class NoopPduPipeline : IPduPipeline
{ {
@@ -14,6 +14,5 @@ internal sealed class NoopPduPipeline : IPduPipeline
PduContext context) PduContext context)
{ {
// Intentional no-op: bytes forwarded unmodified. // Intentional no-op: bytes forwarded unmodified.
// Phase 04: replace this registration with BcdPduPipeline.
} }
} }
+15 -16
View File
@@ -14,20 +14,20 @@ namespace Mbproxy.Proxy;
/// served by the same <see cref="Multiplexing.PlcMultiplexer"/>; all mutable state is /// served by the same <see cref="Multiplexing.PlcMultiplexer"/>; all mutable state is
/// accessed through <see cref="ProxyCounters"/> which uses Interlocked for thread-safety. /// accessed through <see cref="ProxyCounters"/> which uses Interlocked for thread-safety.
/// ///
/// <para><b>Phase 9 — request correlation:</b> the multiplexer sets <see cref="CurrentRequest"/> /// <para><b>Request correlation:</b> the multiplexer sets <see cref="CurrentRequest"/>
/// before calling the pipeline on each direction. On the request path the pipeline can /// before calling the pipeline on each direction. On the request path the pipeline can
/// peek at the future correlation entry it just enqueued; on the response path the pipeline /// peek at the future correlation entry it just enqueued; on the response path the
/// uses the request's <c>StartAddress</c>/<c>Qty</c> to decode FC03/FC04 BCD slots. Different /// pipeline uses the request's <c>StartAddress</c>/<c>Qty</c> to decode FC03/FC04 BCD
/// in-flight responses use different <see cref="InFlightRequest"/> instances, so there is no /// slots. Different in-flight responses use different <see cref="InFlightRequest"/>
/// cross-talk between concurrent multiplexed requests.</para> /// instances, so there is no cross-talk between concurrent multiplexed requests.</para>
/// ///
/// <para><b>Concurrency:</b> a single <see cref="PerPlcContext"/> instance is shared across /// <para><b>Concurrency:</b> a single <see cref="PerPlcContext"/> instance is shared
/// the per-upstream read tasks (which call the pipeline on the request path) and the /// across the per-upstream read tasks (which call the pipeline on the request path) and
/// single backend reader task (which calls the pipeline on the response path). Because the /// the single backend reader task (which calls the pipeline on the response path).
/// per-call <see cref="CurrentRequest"/> would be racy if mutated on the shared context, /// Because the per-call <see cref="CurrentRequest"/> would be racy if mutated on the
/// the multiplexer constructs a lightweight per-call clone (<see cref="WithCurrentRequest"/>) /// shared context, the multiplexer constructs a lightweight per-call clone
/// for each pipeline invocation. The shared mutable state — the tag map, counters, logger — /// (<see cref="WithCurrentRequest"/>) for each pipeline invocation. The shared mutable
/// is read-only or Interlocked.</para> /// state — the tag map, counters, logger — is read-only or Interlocked.</para>
/// </summary> /// </summary>
internal class PerPlcContext : PduContext internal class PerPlcContext : PduContext
{ {
@@ -46,10 +46,9 @@ internal class PerPlcContext : PduContext
internal InFlightRequest? CurrentRequest { get; init; } internal InFlightRequest? CurrentRequest { get; init; }
/// <summary> /// <summary>
/// Phase 11 — optional per-PLC response cache. <c>null</c> on contexts that opt out /// Optional per-PLC response cache. <c>null</c> on contexts that opt out (every BCD
/// (every BCD tag has <see cref="BcdTag.CacheTtlMs"/> = 0) or in unit tests that don't /// tag has <see cref="BcdTag.CacheTtlMs"/> = 0) or in unit tests that don't exercise
/// exercise the cache. The multiplexer constructs and disposes the cache alongside /// the cache. The multiplexer constructs and disposes the cache alongside itself.
/// itself.
/// </summary> /// </summary>
internal ResponseCache? Cache { get; init; } internal ResponseCache? Cache { get; init; }
+23 -18
View File
@@ -11,15 +11,13 @@ namespace Mbproxy.Proxy;
/// Owns one <see cref="TcpListener"/> bound to a PLC's configured listen port and one /// Owns one <see cref="TcpListener"/> bound to a PLC's configured listen port and one
/// <see cref="PlcMultiplexer"/> that owns the single backend connection to the PLC. /// <see cref="PlcMultiplexer"/> that owns the single backend connection to the PLC.
/// ///
/// <para><b>Phase 9 — TxId multiplexing:</b> the listener no longer pairs each upstream /// <para>Every accepted upstream is wrapped in an <see cref="UpstreamPipe"/> and handed
/// socket with a dedicated backend socket. Instead, every accepted upstream is wrapped /// to the multiplexer, which TxId-multiplexes them onto a single backend socket — this
/// in an <see cref="UpstreamPipe"/> and handed to the multiplexer. The multiplexer holds /// eliminates the H2-ECOM100's 4-concurrent-client cap from the upstream side.</para>
/// at most one TCP connection to the PLC, eliminating the H2-ECOM100's 4-concurrent-client
/// cap from the upstream side.</para>
/// ///
/// <para>The listener's accept loop is otherwise unchanged. <see cref="StartAsync"/> /// <para><see cref="StartAsync"/> binds the socket; <see cref="RunAsync"/> runs until
/// binds the socket; <see cref="RunAsync"/> runs until cancelled or the listener faults; /// cancelled or the listener faults; <see cref="DisposeAsync"/> tears down both the
/// <see cref="DisposeAsync"/> tears down both the listener and the multiplexer.</para> /// listener and the multiplexer.</para>
/// </summary> /// </summary>
internal sealed partial class PlcListener : IAsyncDisposable internal sealed partial class PlcListener : IAsyncDisposable
{ {
@@ -32,6 +30,10 @@ internal sealed partial class PlcListener : IAsyncDisposable
private readonly PerPlcContext? _perPlcContext; private readonly PerPlcContext? _perPlcContext;
private readonly ResiliencePipeline? _backendConnectPipeline; private readonly ResiliencePipeline? _backendConnectPipeline;
private readonly Func<ReadCoalescingOptions>? _coalescingOptions; private readonly Func<ReadCoalescingOptions>? _coalescingOptions;
// Live keepalive accessor (TCP SO_KEEPALIVE on accepted upstream sockets + the backend
// heartbeat). Non-null after construction — falls back to the construction-time
// ConnectionOptions snapshot when no live accessor is supplied.
private readonly Func<KeepaliveOptions> _keepaliveOptions;
private TcpListener? _listener; private TcpListener? _listener;
private PlcMultiplexer? _multiplexer; private PlcMultiplexer? _multiplexer;
@@ -49,9 +51,9 @@ internal sealed partial class PlcListener : IAsyncDisposable
=> _multiplexer?.AttachedPipes ?? Array.Empty<UpstreamPipe>(); => _multiplexer?.AttachedPipes ?? Array.Empty<UpstreamPipe>();
/// <summary> /// <summary>
/// Phase 12 (W1.1) — exposes the running multiplexer so a hot-reload reseat can swap /// Exposes the running multiplexer so a hot-reload reseat can swap the per-PLC
/// the per-PLC context on the live instance. <c>null</c> between StopAsync and a fresh /// context on the live instance. <c>null</c> between StopAsync and a fresh start;
/// start; callers must null-check. /// callers must null-check.
/// </summary> /// </summary>
internal PlcMultiplexer? Multiplexer => _multiplexer; internal PlcMultiplexer? Multiplexer => _multiplexer;
@@ -64,7 +66,8 @@ internal sealed partial class PlcListener : IAsyncDisposable
ILogger pipeLogger, ILogger pipeLogger,
PerPlcContext? perPlcContext = null, PerPlcContext? perPlcContext = null,
ResiliencePipeline? backendConnectPipeline = null, ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null) Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{ {
_plc = plc; _plc = plc;
_connectionOptions = connectionOptions; _connectionOptions = connectionOptions;
@@ -75,6 +78,7 @@ internal sealed partial class PlcListener : IAsyncDisposable
_perPlcContext = perPlcContext; _perPlcContext = perPlcContext;
_backendConnectPipeline = backendConnectPipeline; _backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions; _coalescingOptions = coalescingOptions;
_keepaliveOptions = keepaliveOptions ?? (() => _connectionOptions.Keepalive);
} }
/// <summary> /// <summary>
@@ -89,10 +93,10 @@ internal sealed partial class PlcListener : IAsyncDisposable
_listener.Start(); _listener.Start();
LogBound(_listenerLogger, _plc.Name, _plc.ListenPort); LogBound(_listenerLogger, _plc.Name, _plc.ListenPort);
// The multiplexer needs a PerPlcContext to share the BCD tag map and counters with // The multiplexer needs a PerPlcContext to share the BCD tag map and counters
// the pipeline. If the caller (typically a test or pre-Phase-6 startup path) didn't // with the pipeline. If the caller (typically a test) didn't supply one,
// supply one, construct a minimal context that exposes only the PlcName so the // construct a minimal context that exposes only the PlcName so the multiplexer
// multiplexer + a noop/passthrough pipeline still round-trip frames correctly. // + a noop/passthrough pipeline still round-trip frames correctly.
var ctx = _perPlcContext ?? new PerPlcContext var ctx = _perPlcContext ?? new PerPlcContext
{ {
PlcName = _plc.Name, PlcName = _plc.Name,
@@ -105,7 +109,8 @@ internal sealed partial class PlcListener : IAsyncDisposable
ctx, ctx,
_multiplexerLogger, _multiplexerLogger,
_backendConnectPipeline, _backendConnectPipeline,
_coalescingOptions); _coalescingOptions,
_keepaliveOptions);
} }
/// <summary> /// <summary>
@@ -127,7 +132,7 @@ internal sealed partial class PlcListener : IAsyncDisposable
{ {
Socket upstream = await _listener.AcceptSocketAsync(ct).ConfigureAwait(false); Socket upstream = await _listener.AcceptSocketAsync(ct).ConfigureAwait(false);
var pipe = new UpstreamPipe(upstream, _plc.Name, _pipeLogger); var pipe = new UpstreamPipe(upstream, _plc.Name, _pipeLogger, _keepaliveOptions());
var pipeTask = Task.Run(async () => var pipeTask = Task.Run(async () =>
{ {
try try
+107 -69
View File
@@ -1,13 +1,11 @@
namespace Mbproxy.Proxy; namespace Mbproxy.Proxy;
/// <summary> /// <summary>
/// Immutable snapshot of per-PLC counters. Consumed by Phase 07's status page. /// Immutable snapshot of per-PLC counters. Consumed by the status page.
/// All fields are point-in-time reads; no ordering guarantees across fields. /// All fields are point-in-time reads; no ordering guarantees across fields.
/// ///
/// <para><b>Backwards-compat policy (see docs/kpi.md):</b> fields are <i>added</i>, never /// <para><b>Backwards-compat policy:</b> fields are <i>added</i>, never
/// renamed or removed. Phase 9 appended <c>InFlightCount</c>, <c>MaxInFlight</c>, /// renamed or removed.</para>
/// <c>TxIdWraps</c>, <c>BackendDisconnectCascades</c>, and <c>BackendQueueDepth</c> for
/// the TxId-multiplexer telemetry surface (Tier 1.6 in docs/kpi.md).</para>
/// </summary> /// </summary>
public sealed record CounterSnapshot( public sealed record CounterSnapshot(
long PdusForwarded, long PdusForwarded,
@@ -53,86 +51,106 @@ public sealed record CounterSnapshot(
long ConnectsFailed, long ConnectsFailed,
/// <summary> /// <summary>
/// Number of Modbus requests currently in flight on this PLC's multiplexed backend /// Number of Modbus requests currently in flight on this PLC's multiplexed backend
/// connection (point-in-time snapshot of the correlation map size). Phase 9. /// connection (point-in-time snapshot of the correlation map size).
/// </summary> /// </summary>
long InFlightCount, long InFlightCount,
/// <summary> /// <summary>
/// Peak <see cref="InFlightCount"/> observed since the multiplexer was constructed. /// Peak <see cref="InFlightCount"/> observed since the multiplexer was constructed.
/// Updated via <see cref="Interlocked"/> CAS so concurrent in-flight increments do not /// Updated via <see cref="Interlocked"/> CAS so concurrent in-flight increments do
/// lose the high-water mark. Phase 9. /// not lose the high-water mark.
/// </summary> /// </summary>
long MaxInFlight, long MaxInFlight,
/// <summary> /// <summary>
/// Number of times the per-PLC TxId allocator's rolling cursor has wrapped /// Number of times the per-PLC TxId allocator's rolling cursor has wrapped
/// 0xFFFF → 0x0000. A non-zero value is benign; a sudden burst suggests extreme /// 0xFFFF → 0x0000. A non-zero value is benign; a sudden burst suggests extreme
/// in-flight churn. Phase 9. /// in-flight churn.
/// </summary> /// </summary>
long TxIdWraps, long TxIdWraps,
/// <summary> /// <summary>
/// Cumulative count of upstream pipes closed as a side effect of a backend disconnect. /// Cumulative count of upstream pipes closed as a side effect of a backend
/// Each backend reconnect cycle adds the number of attached upstream clients at the /// disconnect. Each backend reconnect cycle adds the number of attached upstream
/// time of the disconnect. Phase 9. /// clients at the time of the disconnect.
/// </summary> /// </summary>
long BackendDisconnectCascades, long BackendDisconnectCascades,
/// <summary> /// <summary>
/// Current depth of the per-PLC outbound channel feeding the backend writer task /// Current depth of the per-PLC outbound channel feeding the backend writer task
/// (frames queued, not yet on the wire). A sustained non-zero value indicates the /// (frames queued, not yet on the wire). A sustained non-zero value indicates the
/// backend is slower than upstream demand. Phase 9. /// backend is slower than upstream demand.
/// </summary> /// </summary>
long BackendQueueDepth, long BackendQueueDepth,
/// <summary> /// <summary>
/// Phase 10 — cumulative count of FC03/FC04 requests that attached to an already-in-flight /// Cumulative count of FC03/FC04 requests that attached to an already-in-flight
/// peer instead of opening a fresh backend round-trip. <c>CoalescedHitCount + CoalescedMissCount</c> /// peer instead of opening a fresh backend round-trip.
/// equals total FC03/FC04 requests seen by the multiplexer. /// <c>CoalescedHitCount + CoalescedMissCount</c> equals total FC03/FC04 requests
/// seen by the multiplexer.
/// </summary> /// </summary>
long CoalescedHitCount, long CoalescedHitCount,
/// <summary> /// <summary>
/// Phase 10 — cumulative count of FC03/FC04 requests that opened a fresh in-flight entry /// Cumulative count of FC03/FC04 requests that opened a fresh in-flight entry (no
/// (no matching peer was in flight, or the matching peer had reached its <c>MaxParties</c> /// matching peer was in flight, or the matching peer had reached its
/// cap). With <c>ReadCoalescing.Enabled = false</c>, every FC03/FC04 request becomes a miss. /// <c>MaxParties</c> cap). With <c>ReadCoalescing.Enabled = false</c>, every
/// FC03/FC04 request becomes a miss.
/// </summary> /// </summary>
long CoalescedMissCount, long CoalescedMissCount,
/// <summary> /// <summary>
/// Phase 10 — count of coalesced response fan-outs that were skipped because the /// Count of coalesced response fan-outs that were skipped because the attached
/// attached upstream pipe had already disconnected. A spike is a churn indicator; the /// upstream pipe had already disconnected. A spike is a churn indicator; the metric
/// metric itself is informational (Tier 2 in <c>docs/kpi.md</c>). /// itself is informational.
/// </summary> /// </summary>
long CoalescedResponseToDeadUpstream, long CoalescedResponseToDeadUpstream,
/// <summary> /// <summary>
/// Phase 11 — cumulative count of FC03/FC04 requests served from the response cache. /// Cumulative count of FC03/FC04 requests served from the response cache.
/// <c>CacheHitCount + CacheMissCount</c> equals total FC03/FC04 requests whose resolved /// <c>CacheHitCount + CacheMissCount</c> equals total FC03/FC04 requests whose
/// TTL was &gt; 0 (cache-eligible). Reads against tags with TTL = 0 increment neither. /// resolved TTL was &gt; 0 (cache-eligible). Reads against tags with TTL = 0
/// increment neither.
/// </summary> /// </summary>
long CacheHitCount, long CacheHitCount,
/// <summary> /// <summary>
/// Phase 11 — cumulative count of cache-eligible FC03/FC04 requests that fell through /// Cumulative count of cache-eligible FC03/FC04 requests that fell through to
/// to coalescing / backend (no fresh entry was present or the entry had expired). /// coalescing / backend (no fresh entry was present or the entry had expired).
/// </summary> /// </summary>
long CacheMissCount, long CacheMissCount,
/// <summary> /// <summary>
/// Phase 11 — cumulative count of cache entries invalidated by overlapping FC06/FC16 /// Cumulative count of cache entries invalidated by overlapping FC06/FC16 write
/// write responses. A high rate suggests caching is fighting writes; consider lower /// responses. A high rate suggests caching is fighting writes; consider lower TTLs
/// TTLs on cache-overlapping tags. /// on cache-overlapping tags.
/// </summary> /// </summary>
long CacheInvalidations, long CacheInvalidations,
/// <summary> /// <summary>
/// Phase 11 — point-in-time snapshot of the per-PLC <see cref="Cache.ResponseCache"/> /// Point-in-time snapshot of the per-PLC <see cref="Cache.ResponseCache"/> entry
/// entry count. Read on the snapshot path; 0 when no cache is wired. /// count. Read on the snapshot path; 0 when no cache is wired.
/// </summary> /// </summary>
long CacheEntryCount, long CacheEntryCount,
/// <summary> /// <summary>
/// Phase 11 — point-in-time approximation of cached PDU bytes for this PLC. Sum of /// Point-in-time approximation of cached PDU bytes for this PLC. Sum of
/// <see cref="Cache.CacheEntry.Length"/> across entries. Read on the snapshot path. /// <see cref="Cache.CacheEntry.Length"/> across entries. Read on the snapshot path.
/// </summary> /// </summary>
long CacheBytes, long CacheBytes,
/// <summary> /// <summary>
/// Phase 12 (W1.3) — cumulative count of backend response frames the per-PLC reader /// Cumulative count of backend response frames the per-PLC reader task dropped
/// task dropped because the destination upstream pipe's bounded response channel was /// because the destination upstream pipe's bounded response channel was full. A
/// full. A non-zero value indicates one or more upstream clients are not draining their /// non-zero value indicates one or more upstream clients are not draining their
/// socket fast enough to keep up with the backend; the wedged client loses its own /// socket fast enough to keep up with the backend; the wedged client loses its own
/// responses but its peers on the same PLC continue to receive theirs. /// responses but its peers on the same PLC continue to receive theirs.
/// </summary> /// </summary>
long ResponseDropForFullUpstream); long ResponseDropForFullUpstream,
/// <summary>
/// Cumulative count of backend keepalive heartbeat probes issued (synthetic FC03
/// qty=1 reads sent on an idle backend socket).
/// </summary>
long BackendHeartbeatsSent,
/// <summary>
/// Cumulative count of backend keepalive heartbeat probes that were not answered
/// within <c>BackendRequestTimeoutMs</c>. Each failure triggers a proactive backend
/// teardown (see <see cref="BackendIdleDisconnects"/>).
/// </summary>
long BackendHeartbeatsFailed,
/// <summary>
/// Cumulative count of backend teardowns triggered by a failed keepalive heartbeat.
/// Distinct from <see cref="BackendDisconnectCascades"/> (which counts cascaded
/// pipes); this counts the disconnect <i>events</i> attributed to keepalive.
/// </summary>
long BackendIdleDisconnects);
/// <summary> /// <summary>
/// Thread-safe per-PLC counters backed by <see cref="System.Threading.Interlocked"/> longs. /// Thread-safe per-PLC counters backed by <see cref="System.Threading.Interlocked"/> longs.
@@ -163,34 +181,39 @@ internal sealed class ProxyCounters
private long _connectsSuccess; private long _connectsSuccess;
private long _connectsFailed; private long _connectsFailed;
// Phase 9 multiplexer telemetry. // Multiplexer telemetry.
private long _maxInFlight; private long _maxInFlight;
private long _backendDisconnectCascades; private long _backendDisconnectCascades;
// Phase 10 — coalescing counters. Hit + Miss = total FC03/FC04 requests. // Coalescing counters. Hit + Miss = total FC03/FC04 requests.
private long _coalescedHitCount; private long _coalescedHitCount;
private long _coalescedMissCount; private long _coalescedMissCount;
private long _coalescedResponseToDeadUpstream; private long _coalescedResponseToDeadUpstream;
// Phase 11 — response-cache counters. Hit + Miss = total cache-eligible FC03/FC04. // Response-cache counters. Hit + Miss = total cache-eligible FC03/FC04.
private long _cacheHitCount; private long _cacheHitCount;
private long _cacheMissCount; private long _cacheMissCount;
private long _cacheInvalidations; private long _cacheInvalidations;
// Phase 12 (W1.3) — backend-reader fan-out drop counter. Increments when the reader // Backend-reader fan-out drop counter. Increments when the reader task tried to
// task tried to enqueue a response to an upstream pipe whose bounded response channel // enqueue a response to an upstream pipe whose bounded response channel was full.
// was full. Without the non-blocking enqueue this would deadlock the reader; with it // Without the non-blocking enqueue this would deadlock the reader; with it we drop
// we drop and account. // and account.
private long _responseDropForFullUpstream; private long _responseDropForFullUpstream;
// Phase 11 — live cache state pulled from a per-PLC ResponseCache on each snapshot. // Backend keepalive heartbeat counters.
// The multiplexer registers a single provider via SetCacheStatsProvider so the status private long _backendHeartbeatsSent;
private long _backendHeartbeatsFailed;
private long _backendIdleDisconnects;
// Live cache state pulled from a per-PLC ResponseCache on each snapshot. The
// multiplexer registers a single provider via SetCacheStatsProvider so the status
// page sees current entry-count / bytes without a separate poll. // page sees current entry-count / bytes without a separate poll.
private volatile ICacheStatsProvider? _cacheStatsProvider; private volatile ICacheStatsProvider? _cacheStatsProvider;
// Phase 9: live state pulled from the multiplexer's allocator/map/queue on each // Live state pulled from the multiplexer's allocator/map/queue on each snapshot.
// snapshot. The multiplexer registers a single provider via SetMultiplexProvider. // The multiplexer registers a single provider via SetMultiplexProvider. We use a
// We use a volatile reference for lock-free read on the snapshot path. // volatile reference for lock-free read on the snapshot path.
private volatile IMultiplexCountersProvider? _multiplexProvider; private volatile IMultiplexCountersProvider? _multiplexProvider;
// LastBindError is a string (not a long); accessed via volatile field on ProxyCounters // LastBindError is a string (not a long); accessed via volatile field on ProxyCounters
// but actually stored on the supervisor. We expose it here for snapshot parity. // but actually stored on the supervisor. We expose it here for snapshot parity.
@@ -269,61 +292,73 @@ internal sealed class ProxyCounters
=> Interlocked.Increment(ref _connectsFailed); => Interlocked.Increment(ref _connectsFailed);
/// <summary> /// <summary>
/// Records <paramref name="n"/> upstream pipes closed by a backend disconnect cascade. /// Records <paramref name="n"/> upstream pipes closed by a backend disconnect
/// Phase 9. /// cascade.
/// </summary> /// </summary>
public void AddDisconnectCascades(int n) public void AddDisconnectCascades(int n)
=> Interlocked.Add(ref _backendDisconnectCascades, n); => Interlocked.Add(ref _backendDisconnectCascades, n);
/// <summary> /// <summary>
/// Phase 10 — records one FC03/FC04 request that attached to an already-in-flight peer. /// Records one FC03/FC04 request that attached to an already-in-flight peer.
/// </summary> /// </summary>
public void IncrementCoalescedHit() public void IncrementCoalescedHit()
=> Interlocked.Increment(ref _coalescedHitCount); => Interlocked.Increment(ref _coalescedHitCount);
/// <summary> /// <summary>
/// Phase 10 — records one FC03/FC04 request that opened a fresh in-flight entry /// Records one FC03/FC04 request that opened a fresh in-flight entry (no matching
/// (no matching peer was in flight, or the matching peer had reached MaxParties). /// peer was in flight, or the matching peer had reached MaxParties).
/// </summary> /// </summary>
public void IncrementCoalescedMiss() public void IncrementCoalescedMiss()
=> Interlocked.Increment(ref _coalescedMissCount); => Interlocked.Increment(ref _coalescedMissCount);
/// <summary> /// <summary>
/// Phase 10 — records one coalesced response fan-out that was skipped because the /// Records one coalesced response fan-out that was skipped because the attached
/// attached upstream pipe had already disconnected. Informational only. /// upstream pipe had already disconnected. Informational only.
/// </summary> /// </summary>
public void IncrementCoalescedResponseToDeadUpstream() public void IncrementCoalescedResponseToDeadUpstream()
=> Interlocked.Increment(ref _coalescedResponseToDeadUpstream); => Interlocked.Increment(ref _coalescedResponseToDeadUpstream);
/// <summary>Phase 11 — records one FC03/FC04 cache hit.</summary> /// <summary>Records one FC03/FC04 cache hit.</summary>
public void IncrementCacheHit() public void IncrementCacheHit()
=> Interlocked.Increment(ref _cacheHitCount); => Interlocked.Increment(ref _cacheHitCount);
/// <summary>Phase 11 — records one cache-eligible FC03/FC04 read that missed.</summary> /// <summary>Records one cache-eligible FC03/FC04 read that missed.</summary>
public void IncrementCacheMiss() public void IncrementCacheMiss()
=> Interlocked.Increment(ref _cacheMissCount); => Interlocked.Increment(ref _cacheMissCount);
/// <summary>Phase 11 — records <paramref name="n"/> cache entries invalidated by a write.</summary> /// <summary>Records <paramref name="n"/> cache entries invalidated by a write.</summary>
public void AddCacheInvalidations(int n) public void AddCacheInvalidations(int n)
=> Interlocked.Add(ref _cacheInvalidations, n); => Interlocked.Add(ref _cacheInvalidations, n);
/// <summary> /// <summary>
/// Phase 12 (W1.3) — records one backend response frame dropped because the destination /// Records one backend response frame dropped because the destination upstream
/// upstream pipe's response channel was full. /// pipe's response channel was full.
/// </summary> /// </summary>
public void IncrementResponseDropForFullUpstream() public void IncrementResponseDropForFullUpstream()
=> Interlocked.Increment(ref _responseDropForFullUpstream); => Interlocked.Increment(ref _responseDropForFullUpstream);
/// <summary>Records one backend keepalive heartbeat probe sent.</summary>
public void IncrementBackendHeartbeatSent()
=> Interlocked.Increment(ref _backendHeartbeatsSent);
/// <summary>Records one backend keepalive heartbeat probe that timed out.</summary>
public void IncrementBackendHeartbeatFailed()
=> Interlocked.Increment(ref _backendHeartbeatsFailed);
/// <summary>Records one backend teardown triggered by a failed keepalive heartbeat.</summary>
public void IncrementBackendIdleDisconnect()
=> Interlocked.Increment(ref _backendIdleDisconnects);
/// <summary> /// <summary>
/// Phase 11 — wires the per-PLC <see cref="Cache.ResponseCache"/> as the live stats /// Wires the per-PLC <see cref="Cache.ResponseCache"/> as the live stats source for
/// source for the snapshot path. Pass <c>null</c> to detach during disposal. /// the snapshot path. Pass <c>null</c> to detach during disposal.
/// </summary> /// </summary>
internal void SetCacheStatsProvider(ICacheStatsProvider? provider) internal void SetCacheStatsProvider(ICacheStatsProvider? provider)
=> _cacheStatsProvider = provider; => _cacheStatsProvider = provider;
/// <summary> /// <summary>
/// CAS-updates the peak in-flight high-water mark. Called on every successful /// CAS-updates the peak in-flight high-water mark. Called on every successful
/// allocation by the multiplexer. Phase 9. /// allocation by the multiplexer.
/// </summary> /// </summary>
public void ObserveInFlight(int currentInFlight) public void ObserveInFlight(int currentInFlight)
{ {
@@ -341,7 +376,7 @@ internal sealed class ProxyCounters
/// Wires the live multiplexer telemetry source into this counter set. Called by /// Wires the live multiplexer telemetry source into this counter set. Called by
/// <see cref="Mbproxy.Proxy.Multiplexing.PlcMultiplexer"/> at construction time so /// <see cref="Mbproxy.Proxy.Multiplexing.PlcMultiplexer"/> at construction time so
/// the status page's <see cref="Snapshot"/> can include live in-flight / queue-depth /// the status page's <see cref="Snapshot"/> can include live in-flight / queue-depth
/// values without polling the multiplexer separately. Phase 9. /// values without polling the multiplexer separately.
/// </summary> /// </summary>
internal void SetMultiplexProvider(IMultiplexCountersProvider? provider) internal void SetMultiplexProvider(IMultiplexCountersProvider? provider)
=> _multiplexProvider = provider; => _multiplexProvider = provider;
@@ -444,7 +479,10 @@ internal sealed class ProxyCounters
CacheInvalidations: Interlocked.Read(ref _cacheInvalidations), CacheInvalidations: Interlocked.Read(ref _cacheInvalidations),
CacheEntryCount: cacheEntries, CacheEntryCount: cacheEntries,
CacheBytes: cacheBytes, CacheBytes: cacheBytes,
ResponseDropForFullUpstream: Interlocked.Read(ref _responseDropForFullUpstream)); ResponseDropForFullUpstream: Interlocked.Read(ref _responseDropForFullUpstream),
BackendHeartbeatsSent: Interlocked.Read(ref _backendHeartbeatsSent),
BackendHeartbeatsFailed: Interlocked.Read(ref _backendHeartbeatsFailed),
BackendIdleDisconnects: Interlocked.Read(ref _backendIdleDisconnects));
} }
} }
@@ -454,7 +492,7 @@ internal sealed class ProxyCounters
/// and registered with <see cref="ProxyCounters.SetMultiplexProvider"/> so /// and registered with <see cref="ProxyCounters.SetMultiplexProvider"/> so
/// <see cref="ProxyCounters.Snapshot"/> can include live mux telemetry without holding /// <see cref="ProxyCounters.Snapshot"/> can include live mux telemetry without holding
/// a direct reference to the multiplexer (which would couple counter snapshots to the /// a direct reference to the multiplexer (which would couple counter snapshots to the
/// connection layer's lifecycle). Phase 9. /// connection layer's lifecycle).
/// </summary> /// </summary>
internal interface IMultiplexCountersProvider internal interface IMultiplexCountersProvider
{ {
@@ -469,8 +507,8 @@ internal interface IMultiplexCountersProvider
} }
/// <summary> /// <summary>
/// Phase 11 — read-only window into the per-PLC <see cref="Cache.ResponseCache"/>'s live /// Read-only window into the per-PLC <see cref="Cache.ResponseCache"/>'s live state
/// state for the snapshot path. The multiplexer wires this on cache construction so the /// for the snapshot path. The multiplexer wires this on cache construction so the
/// status page sees live counts without holding a direct reference to the cache. /// status page sees live counts without holding a direct reference to the cache.
/// </summary> /// </summary>
internal interface ICacheStatsProvider internal interface ICacheStatsProvider
+99 -75
View File
@@ -1,3 +1,4 @@
using System.Collections.Concurrent;
using System.Diagnostics; using System.Diagnostics;
using Mbproxy.Admin; using Mbproxy.Admin;
using Mbproxy.Bcd; using Mbproxy.Bcd;
@@ -13,7 +14,7 @@ namespace Mbproxy.Proxy;
/// <summary> /// <summary>
/// <see cref="BackgroundService"/> that owns all <see cref="PlcListenerSupervisor"/> instances. /// <see cref="BackgroundService"/> that owns all <see cref="PlcListenerSupervisor"/> instances.
/// ///
/// Startup posture (matches design doc "eager, continue on per-port failure"): /// Startup posture (matches docs/Architecture/Overview.md "eager, continue on per-port failure"):
/// <list type="number"> /// <list type="number">
/// <item>Enumerate <see cref="MbproxyOptions.Plcs"/> and build one supervisor per PLC.</item> /// <item>Enumerate <see cref="MbproxyOptions.Plcs"/> and build one supervisor per PLC.</item>
/// <item>Start all supervisors in parallel. Each supervisor attempts to bind immediately /// <item>Start all supervisors in parallel. Each supervisor attempts to bind immediately
@@ -23,8 +24,8 @@ namespace Mbproxy.Proxy;
/// log <c>mbproxy.startup.ready</c> with bound/configured counts.</item> /// log <c>mbproxy.startup.ready</c> with bound/configured counts.</item>
/// </list> /// </list>
/// ///
/// Phase 06: passes the supervisor dictionary to <see cref="ConfigReconciler.Attach"/> /// Passes the supervisor dictionary to <see cref="ConfigReconciler.Attach"/> after
/// after initial startup so hot-reload changes are applied by the reconciler. /// initial startup so hot-reload changes are applied by the reconciler.
/// ///
/// Stop: cancels all supervisors in parallel with a 5-second hard deadline. /// Stop: cancels all supervisors in parallel with a 5-second hard deadline.
/// </summary> /// </summary>
@@ -35,24 +36,30 @@ internal sealed partial class ProxyWorker : BackgroundService
private readonly ILogger<ProxyWorker> _logger; private readonly ILogger<ProxyWorker> _logger;
private readonly ILoggerFactory _loggerFactory; private readonly ILoggerFactory _loggerFactory;
private readonly ConfigReconciler _reconciler; private readonly ConfigReconciler _reconciler;
// Phase 12 (W1.5) — admin endpoint is no longer IHostedService; ProxyWorker drives its // Admin endpoint is not registered as IHostedService; ProxyWorker drives its
// lifecycle directly so the design's "drain THEN stop admin" ordering is honoured. // lifecycle directly so the design's "drain THEN stop admin" ordering is honoured.
// //
// Resolved LAZILY (in ExecuteAsync) rather than in the constructor because the DI graph // Resolved LAZILY (in ExecuteAsync) rather than in the constructor because the DI
// is circular: AdminEndpointHost → StatusSnapshotBuilder → ProxyWorker. A constructor // graph is circular: AdminEndpointHost → StatusSnapshotBuilder → ProxyWorker. A
// GetService<AdminEndpointHost>() during ProxyWorker's own construction returns null // constructor GetService<AdminEndpointHost>() during ProxyWorker's own construction
// silently. Lazy resolution sidesteps the cycle — by the time ExecuteAsync runs the DI // returns null silently. Lazy resolution sidesteps the cycle — by the time
// container is fully built. // ExecuteAsync runs the DI container is fully built.
private readonly IServiceProvider _services; private readonly IServiceProvider _services;
private AdminEndpointHost? _admin; private AdminEndpointHost? _admin;
// Phase 06: supervisors are now managed jointly by ProxyWorker (initial bootstrap) // Supervisors are managed jointly by ProxyWorker (initial bootstrap) and
// and ConfigReconciler (subsequent hot-reload changes). The dictionary is shared // ConfigReconciler (subsequent hot-reload changes). The dictionary is shared via
// via ConfigReconciler.Attach() after initial startup. // ConfigReconciler.Attach() after initial startup.
private readonly Dictionary<string, PlcListenerSupervisor> _supervisors = new(StringComparer.Ordinal); //
// ConcurrentDictionary because ConfigReconciler mutates this from parallel
// Task.WhenAll continuations (Add/Remove/Restart paths). The outer Apply is
// serialised by a semaphore but the inner per-PLC tasks run concurrently.
// Status-page reads via IReadOnlyDictionary still work without locking.
private readonly ConcurrentDictionary<string, PlcListenerSupervisor> _supervisors =
new(StringComparer.Ordinal);
/// <summary> /// <summary>
/// Read-only view of the live supervisor dictionary. Consumed by Phase 07's /// Read-only view of the live supervisor dictionary. Consumed by
/// <see cref="Admin.StatusSnapshotBuilder"/> to enumerate per-PLC state. /// <see cref="Admin.StatusSnapshotBuilder"/> to enumerate per-PLC state.
/// The caller should read this on the status-page path only (not the hot path). /// The caller should read this on the status-page path only (not the hot path).
/// </summary> /// </summary>
@@ -72,7 +79,7 @@ internal sealed partial class ProxyWorker : BackgroundService
_loggerFactory = loggerFactory; _loggerFactory = loggerFactory;
_reconciler = reconciler; _reconciler = reconciler;
_services = services; _services = services;
// Phase 12 (W1.5) — admin endpoint resolved lazily in ExecuteAsync (see field comment). // Admin endpoint resolved lazily in ExecuteAsync (see field comment).
} }
protected override async Task ExecuteAsync(CancellationToken stoppingToken) protected override async Task ExecuteAsync(CancellationToken stoppingToken)
@@ -100,11 +107,11 @@ internal sealed partial class ProxyWorker : BackgroundService
continue; continue;
} }
// Phase 11 — construct a per-PLC response cache only when at least one // Construct a per-PLC response cache only when at least one resolved tag
// resolved tag opts in (CacheTtlMs > 0). Skipping cache construction for a // opts in (CacheTtlMs > 0). Skipping cache construction for a PLC with no
// PLC with no cacheable tags keeps the no-cache path free of the eviction // cacheable tags keeps the no-cache path free of the eviction timer and the
// timer and the per-call resolution cost, preserving "default behaviour = // per-call resolution cost, preserving the "no caching" default behaviour
// Phase 10 unchanged" when no operator has opted any tag in. // when no operator has opted any tag in.
var cache = HasAnyCacheableTag(result.Map) var cache = HasAnyCacheableTag(result.Map)
? new ResponseCache(opts.Cache.MaxEntriesPerPlc, opts.Cache.EvictionIntervalMs) ? new ResponseCache(opts.Cache.MaxEntriesPerPlc, opts.Cache.EvictionIntervalMs)
: null; : null;
@@ -137,12 +144,17 @@ internal sealed partial class ProxyWorker : BackgroundService
resilienceOpts.ListenerRecovery, resilienceOpts.ListenerRecovery,
_loggerFactory.CreateLogger($"Mbproxy.Proxy.ListenerRecovery.{plc.Name}")); _loggerFactory.CreateLogger($"Mbproxy.Proxy.ListenerRecovery.{plc.Name}"));
// Phase 10 — give the supervisor a live accessor for ReadCoalescingOptions // Give the supervisor a live accessor for ReadCoalescingOptions so a
// so a hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` propagates // hot-reload of `Mbproxy.Resilience.ReadCoalescing.Enabled` propagates to
// to the multiplexer's per-PDU coalescing decision. // the multiplexer's per-PDU coalescing decision.
Func<ReadCoalescingOptions> coalescingAccessor = Func<ReadCoalescingOptions> coalescingAccessor =
() => _options.CurrentValue.Resilience.ReadCoalescing; () => _options.CurrentValue.Resilience.ReadCoalescing;
// Live accessor for KeepaliveOptions so a hot-reload of `Connection.Keepalive`
// propagates to the backend heartbeat loop and to upstream-socket keepalive.
Func<KeepaliveOptions> keepaliveAccessor =
() => _options.CurrentValue.Connection.Keepalive;
var supervisor = new PlcListenerSupervisor( var supervisor = new PlcListenerSupervisor(
plc, plc,
opts.Connection, opts.Connection,
@@ -154,17 +166,24 @@ internal sealed partial class ProxyWorker : BackgroundService
recoveryPipeline, recoveryPipeline,
_loggerFactory.CreateLogger<PlcListenerSupervisor>(), _loggerFactory.CreateLogger<PlcListenerSupervisor>(),
backendPipeline, backendPipeline,
coalescingAccessor); coalescingAccessor,
keepaliveAccessor);
_supervisors[plc.Name] = supervisor; _supervisors[plc.Name] = supervisor;
} }
// ── Phase 06: wire reconciler BEFORE starting supervisors ───────────────── // ── Wire reconciler BEFORE starting supervisors ──────────────────────────
// Attach hands the reconciler the authoritative supervisor dictionary and the // Attach hands the reconciler the authoritative supervisor dictionary and the
// initial options snapshot. The reconciler won't process OnChange events until // initial options snapshot. The reconciler won't process OnChange events until
// after this call — the brief window between Attach and first supervisor start // after this call — the brief window between Attach and first supervisor start
// is safe because the channel signal only enqueues; apply runs asynchronously. // is safe because the channel signal only enqueues; apply runs asynchronously.
_reconciler.Attach(_supervisors, opts); // Pass the live coalescing accessor so reconciler-built supervisors
// (add/restart paths) honour hot-reloaded ReadCoalescing values.
Func<ReadCoalescingOptions> reconcilerCoalescingAccessor =
() => _options.CurrentValue.Resilience.ReadCoalescing;
Func<KeepaliveOptions> reconcilerKeepaliveAccessor =
() => _options.CurrentValue.Connection.Keepalive;
_reconciler.Attach(_supervisors, opts, reconcilerCoalescingAccessor, reconcilerKeepaliveAccessor);
if (_supervisors.Count == 0) if (_supervisors.Count == 0)
{ {
@@ -202,10 +221,10 @@ internal sealed partial class ProxyWorker : BackgroundService
int boundCount = _supervisors.Values.Count(s => s.Snapshot().State == SupervisorState.Bound); int boundCount = _supervisors.Values.Count(s => s.Snapshot().State == SupervisorState.Bound);
LogStartupReady(_logger, boundCount, plcsConfigured); LogStartupReady(_logger, boundCount, plcsConfigured);
// Phase 12 (W1.5) — start the admin endpoint AFTER listeners are bound so the // Start the admin endpoint AFTER listeners are bound so the status page can
// status page can never observe the service in a "no PLCs configured yet" state. // never observe the service in a "no PLCs configured yet" state. The admin
// The admin endpoint is no longer registered as IHostedService (the host's reverse // endpoint is not registered as IHostedService (the host's reverse stop order
// stop order would tear it down BEFORE drain). ProxyWorker drives both ends. // would tear it down BEFORE drain) ProxyWorker drives both ends.
// //
// Resolution happens here, not in the constructor — the DI graph is circular // Resolution happens here, not in the constructor — the DI graph is circular
// (admin → StatusSnapshotBuilder → ProxyWorker) and a constructor-time lookup // (admin → StatusSnapshotBuilder → ProxyWorker) and a constructor-time lookup
@@ -222,6 +241,15 @@ internal sealed partial class ProxyWorker : BackgroundService
_logger.LogError(ex, "Admin endpoint failed to start: {Message}", ex.Message); _logger.LogError(ex, "Admin endpoint failed to start: {Message}", ex.Message);
} }
} }
else
{
// Surface the absence. The lazy lookup returns null silently if
// AddMbproxyAdmin() is missing from Program.cs; a single warning makes a
// botched composition observable without blocking startup.
_logger.LogWarning(
"Admin endpoint not registered (AddMbproxyAdmin() missing from composition). " +
"Status page will be unavailable; service continues without it.");
}
// ── 6. Keep the worker alive until the host signals stop ───────────────────── // ── 6. Keep the worker alive until the host signals stop ─────────────────────
// Supervisors run their own background loops; ExecuteAsync just waits. // Supervisors run their own background loops; ExecuteAsync just waits.
@@ -229,32 +257,53 @@ internal sealed partial class ProxyWorker : BackgroundService
} }
/// <summary> /// <summary>
/// Phase 12 (W1.5) — graceful shutdown sequence (replaces the deleted /// Graceful shutdown sequence:
/// <c>ShutdownCoordinator</c>):
/// <list type="number"> /// <list type="number">
/// <item>Cancel <see cref="ExecuteAsync"/> via <c>base.StopAsync</c>.</item> /// <item>Cancel <see cref="ExecuteAsync"/> via <c>base.StopAsync</c>.</item>
/// <item>Stop all supervisors with a 5 s hard deadline (no new connections; existing /// <item><b>Snapshot</b> per-PLC in-flight counts BEFORE stopping supervisors —
/// pipes are cascaded by <see cref="PlcListenerSupervisor"/> teardown).</item> /// this is the only honest reading of "how many requests were in flight when
/// <item>Wait for in-flight PDUs to drain via the live /// we decided to stop." Once supervisors stop, their multiplexers are torn
/// <see cref="ConnectionOptions.GracefulShutdownTimeoutMs"/> (read fresh from /// down and the per-mux counter providers are nulled, so any later read
/// <see cref="IOptionsMonitor{T}.CurrentValue"/> so a hot-reloaded value is /// returns 0 regardless of what was actually dropped.</item>
/// honoured at stop time).</item> /// <item>Stop all supervisors with the configured graceful timeout. Supervisor
/// <item>Stop the admin endpoint LAST so the status page survives the drain phase /// stop is the actual drain — it cancels the listener, which exits its
/// and an operator polling it sees the in-flight count fall to zero.</item> /// accept loop, which disposes the multiplexer, which cascades all attached
/// pipes. There is no separate "drain in-flight" phase because there is
/// nothing to drain that wouldn't be killed by the supervisor stop itself.</item>
/// <item>Stop the admin endpoint LAST so the status page survives the supervisor
/// stop phase and operators can observe the live state right up to shutdown.</item>
/// <item>Dispose every supervisor to release sockets, channels, and watchdog timers.</item> /// <item>Dispose every supervisor to release sockets, channels, and watchdog timers.</item>
/// </list> /// </list>
/// Logs <c>mbproxy.shutdown.complete</c> on the way out with the in-flight count at /// Logs <c>mbproxy.shutdown.complete</c> with <c>InFlightAtCancel</c> equal to the
/// drain-deadline (zero on a clean shutdown, positive when forced cancel). /// snapshot count from step 2 (= the number of in-flight requests dropped by the
/// stop) and <c>ElapsedMs</c> for the whole sequence.
/// </summary> /// </summary>
public override async Task StopAsync(CancellationToken cancellationToken) public override async Task StopAsync(CancellationToken cancellationToken)
{ {
// Snapshot in-flight BEFORE base.StopAsync so the field matches its name: "the
// count at the moment the host signalled stop", not "the count at the moment we
// got around to computing it." `base.StopAsync` cancels the ExecuteAsync
// stoppingToken; in the milliseconds before it returns, in-flight requests
// whose responses arrive will be removed from _correlation and the watchdog can
// clear stale entries — the count would otherwise drift downward.
//
// Must run BEFORE supervisor stop too: after supervisor.StopAsync, multiplexers
// are disposed and CountInFlight returns 0 unconditionally.
int inFlightAtCancel = CountInFlight();
// Cancel ExecuteAsync first. // Cancel ExecuteAsync first.
await base.StopAsync(cancellationToken).ConfigureAwait(false); await base.StopAsync(cancellationToken).ConfigureAwait(false);
var sw = Stopwatch.StartNew(); var sw = Stopwatch.StartNew();
// ── 1. Stop accepting new connections ───────────────────────────────────────── // Supervisor stop deadline read from the live config so a hot-reloaded
using var stopCts = new CancellationTokenSource(TimeSpan.FromSeconds(5)); // GracefulShutdownTimeoutMs is honoured. Supervisor stop is the drain:
// cancelling the supervisor cancels the listener, which exits accept, which
// disposes the multiplexer, which cascades all attached pipes.
int gracefulMs = _options.CurrentValue.Connection.GracefulShutdownTimeoutMs;
// ── 1. Stop accepting new connections + drain (one combined phase) ────────────
using var stopCts = new CancellationTokenSource(TimeSpan.FromMilliseconds(gracefulMs));
using var linked = CancellationTokenSource.CreateLinkedTokenSource( using var linked = CancellationTokenSource.CreateLinkedTokenSource(
stopCts.Token, cancellationToken); stopCts.Token, cancellationToken);
@@ -271,31 +320,7 @@ internal sealed partial class ProxyWorker : BackgroundService
// Best effort — don't let individual supervisor failures block shutdown. // Best effort — don't let individual supervisor failures block shutdown.
} }
// ── 2. Drain in-flight PDUs ─────────────────────────────────────────────────── // ── 2. Stop admin endpoint LAST ───────────────────────────────────────────────
// Reads the current configured deadline so a hot-reloaded
// GracefulShutdownTimeoutMs is honoured at stop time, not frozen at process start.
int drainDeadlineMs = _options.CurrentValue.Connection.GracefulShutdownTimeoutMs;
int inFlightAtCancel = 0;
if (drainDeadlineMs > 0)
{
using var drainCts = new CancellationTokenSource(TimeSpan.FromMilliseconds(drainDeadlineMs));
try
{
while (!drainCts.Token.IsCancellationRequested)
{
int total = CountInFlight();
if (total == 0) break;
await Task.Delay(10, drainCts.Token).ConfigureAwait(false);
}
}
catch (OperationCanceledException)
{
inFlightAtCancel = CountInFlight();
}
}
// ── 3. Stop admin endpoint LAST ───────────────────────────────────────────────
if (_admin is not null) if (_admin is not null)
{ {
try try
@@ -309,7 +334,7 @@ internal sealed partial class ProxyWorker : BackgroundService
} }
} }
// ── 4. Dispose supervisors (releases sockets, channels, watchdog timers) ───── // ── 3. Dispose supervisors (releases sockets, channels, watchdog timers) ─────
foreach (var supervisor in _supervisors.Values) foreach (var supervisor in _supervisors.Values)
await supervisor.DisposeAsync().ConfigureAwait(false); await supervisor.DisposeAsync().ConfigureAwait(false);
@@ -329,11 +354,11 @@ internal sealed partial class ProxyWorker : BackgroundService
// ── Logging ─────────────────────────────────────────────────────────────────────────── // ── Logging ───────────────────────────────────────────────────────────────────────────
/// <summary> /// <summary>
/// Phase 11 — returns <c>true</c> when at least one BcdTag in the resolved map has a /// Returns <c>true</c> when at least one BcdTag in the resolved map has a positive
/// positive <see cref="BcdTag.CacheTtlMs"/>. A PLC with no cacheable tags skips the /// <see cref="BcdTag.CacheTtlMs"/>. A PLC with no cacheable tags skips the
/// <see cref="Mbproxy.Proxy.Cache.ResponseCache"/> entirely (no eviction timer, no /// <see cref="Mbproxy.Proxy.Cache.ResponseCache"/> entirely (no eviction timer, no
/// per-call cache resolution cost), so the default-OFF deployment is byte-identical /// per-call cache resolution cost), so the default-OFF deployment runs the
/// to a Phase-10 deployment. /// no-cache code path.
/// </summary> /// </summary>
private static bool HasAnyCacheableTag(BcdTagMap map) private static bool HasAnyCacheableTag(BcdTagMap map)
{ {
@@ -352,7 +377,6 @@ internal sealed partial class ProxyWorker : BackgroundService
Message = "Failed to bind listener: Plc={Plc} Port={Port} Reason={Reason}")] Message = "Failed to bind listener: Plc={Plc} Port={Port} Reason={Reason}")]
private static partial void LogBindFailed(ILogger logger, string plc, int port, string reason); private static partial void LogBindFailed(ILogger logger, string plc, int port, string reason);
// Phase 12 (W1.5) — moved here from the deleted ShutdownCoordinator.
[LoggerMessage(EventId = 80, EventName = "mbproxy.shutdown.complete", [LoggerMessage(EventId = 80, EventName = "mbproxy.shutdown.complete",
Level = LogLevel.Information, Level = LogLevel.Information,
Message = "Graceful shutdown complete: InFlightAtCancel={InFlightAtCancel} ElapsedMs={ElapsedMs}")] Message = "Graceful shutdown complete: InFlightAtCancel={InFlightAtCancel} ElapsedMs={ElapsedMs}")]
@@ -2,7 +2,7 @@ namespace Mbproxy.Proxy;
/// <summary> /// <summary>
/// Source-generated <see cref="LoggerMessage"/> definitions for the BCD rewriter pipeline. /// Source-generated <see cref="LoggerMessage"/> definitions for the BCD rewriter pipeline.
/// All event names are stable — do not rename without updating docs/design.md. /// All event names are stable — do not rename without updating docs/Reference/LogEvents.md.
/// </summary> /// </summary>
internal static partial class RewriterLogEvents internal static partial class RewriterLogEvents
{ {
@@ -0,0 +1,49 @@
using System.Net.Sockets;
using Mbproxy.Options;
namespace Mbproxy.Proxy;
/// <summary>
/// Applies OS-level TCP keepalive (<c>SO_KEEPALIVE</c> plus the idle-time / probe-interval /
/// probe-count tunables) to a socket. Used on both the backend socket (proxy → PLC) and
/// accepted upstream sockets (client → proxy) so the OS detects a dead peer on an
/// otherwise-idle connection — the DL205/DL260 ECOM never emits keepalives of its own.
/// </summary>
internal static class SocketKeepalive
{
/// <summary>
/// Enables TCP keepalive on <paramref name="socket"/> from <paramref name="options"/>.
/// A no-op when <see cref="KeepaliveOptions.Enabled"/> is <c>false</c>.
///
/// <para>Failures are swallowed: keepalive is a best-effort belt-and-suspenders measure
/// (the backend application heartbeat is the load-bearing mechanism) and must never
/// abort a connection. The three TCP tunables are also not honoured on every platform;
/// a refusal there is benign.</para>
/// </summary>
public static void Apply(Socket socket, KeepaliveOptions options)
{
if (!options.Enabled) return;
try
{
socket.SetSocketOption(SocketOptionLevel.Socket, SocketOptionName.KeepAlive, true);
// SocketOptionName.TcpKeepAliveTime / TcpKeepAliveInterval are specified in
// SECONDS; round the configured milliseconds up to at least one second.
int idleSec = Math.Max(1, (options.TcpIdleTimeMs + 999) / 1000);
int intervalSec = Math.Max(1, (options.TcpProbeIntervalMs + 999) / 1000);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveTime, idleSec);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveInterval, intervalSec);
socket.SetSocketOption(SocketOptionLevel.Tcp, SocketOptionName.TcpKeepAliveRetryCount, options.TcpProbeCount);
}
catch (SocketException)
{
// Platform refused a tunable — keepalive stays best-effort.
}
catch (ObjectDisposedException)
{
// Socket closed concurrently — nothing to do.
}
}
}
@@ -1,4 +1,5 @@
using Mbproxy.Options; using Mbproxy.Options;
using Mbproxy.Proxy.Cache;
using Mbproxy.Proxy.Multiplexing; using Mbproxy.Proxy.Multiplexing;
using Polly; using Polly;
@@ -37,6 +38,7 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
private readonly ILogger<PlcListenerSupervisor> _logger; private readonly ILogger<PlcListenerSupervisor> _logger;
private readonly ResiliencePipeline? _backendConnectPipeline; private readonly ResiliencePipeline? _backendConnectPipeline;
private readonly Func<ReadCoalescingOptions>? _coalescingOptions; private readonly Func<ReadCoalescingOptions>? _coalescingOptions;
private readonly Func<KeepaliveOptions>? _keepaliveOptions;
// ── Mutable state ──────────────────────────────────────────────────────────────────── // ── Mutable state ────────────────────────────────────────────────────────────────────
@@ -45,15 +47,15 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
private volatile string? _lastBindError; private volatile string? _lastBindError;
private int _recoveryAttempts; // Interlocked private int _recoveryAttempts; // Interlocked
// Phase 07: current active listener for status-page pair enumeration. // Current active listener for status-page pair enumeration.
private volatile PlcListener? _currentListener; private volatile PlcListener? _currentListener;
// Phase 06: _perPlcContext is now mutable so ReplaceContextAsync can swap it. // _perPlcContext is mutable so ReplaceContextAsync can swap it. Access from the accept
// Access from the accept loop (RunAsync) and from ReplaceContextAsync must be // loop (RunAsync) and from ReplaceContextAsync must be coherent; we use a volatile
// coherent; we use a volatile reference so the accept loop always reads the latest // reference so the accept loop always reads the latest context without locking. The
// context without locking. The PlcListener created on each Polly attempt holds // PlcListener created on each Polly attempt holds its own copy of the context at
// its own copy of the context at construction time; existing in-flight connections // construction time; existing in-flight connections keep their old reference until they
// keep their old reference until they complete. // complete.
private volatile PerPlcContext? _currentContext; private volatile PerPlcContext? _currentContext;
/// <summary> /// <summary>
@@ -66,6 +68,18 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
private bool _disposed; private bool _disposed;
// Completes when the supervisor has transitioned out of Stopped for the first time
// (reached Bound or Recovering). Used by WaitForInitialBindAttemptAsync to avoid
// racing fast Stopped→Bound→Stopped transitions or hanging if the supervisor task
// throws inside Polly.
//
// Non-readonly so StartAsync can re-arm it for a re-Started supervisor. Without
// re-arming, a restart-after-stop scenario would have WaitForInitialBindAttemptAsync
// return immediately on the previous run's signal, never observing the new run's
// bind status.
private TaskCompletionSource _firstAttemptCompleted = new(
TaskCreationOptions.RunContinuationsAsynchronously);
// ── Public surface ──────────────────────────────────────────────────────────────────── // ── Public surface ────────────────────────────────────────────────────────────────────
public string PlcName => _plc.Name; public string PlcName => _plc.Name;
@@ -81,7 +95,8 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
ResiliencePipeline recoveryPipeline, ResiliencePipeline recoveryPipeline,
ILogger<PlcListenerSupervisor> logger, ILogger<PlcListenerSupervisor> logger,
ResiliencePipeline? backendConnectPipeline = null, ResiliencePipeline? backendConnectPipeline = null,
Func<ReadCoalescingOptions>? coalescingOptions = null) Func<ReadCoalescingOptions>? coalescingOptions = null,
Func<KeepaliveOptions>? keepaliveOptions = null)
{ {
_plc = plc; _plc = plc;
_connectionOptions = connectionOptions; _connectionOptions = connectionOptions;
@@ -90,11 +105,12 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_multiplexerLogger = multiplexerLogger; _multiplexerLogger = multiplexerLogger;
_pipeLogger = pipeLogger; _pipeLogger = pipeLogger;
_perPlcContext = perPlcContext; _perPlcContext = perPlcContext;
_currentContext = perPlcContext; // Phase 06: live context slot _currentContext = perPlcContext; // live context slot
_recoveryPipeline = recoveryPipeline; _recoveryPipeline = recoveryPipeline;
_logger = logger; _logger = logger;
_backendConnectPipeline = backendConnectPipeline; _backendConnectPipeline = backendConnectPipeline;
_coalescingOptions = coalescingOptions; _coalescingOptions = coalescingOptions;
_keepaliveOptions = keepaliveOptions;
} }
/// <summary> /// <summary>
@@ -107,7 +123,7 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
/// <summary> /// <summary>
/// Live collection of active <see cref="UpstreamPipe"/> instances attached to this /// Live collection of active <see cref="UpstreamPipe"/> instances attached to this
/// PLC's multiplexer. Returns an empty collection when the listener is not bound. /// PLC's multiplexer. Returns an empty collection when the listener is not bound.
/// Consumed by Phase 07's status page (renamed from <c>ActivePairs</c> in Phase 9). /// Consumed by the status page.
/// </summary> /// </summary>
public IReadOnlyCollection<UpstreamPipe> ActiveUpstreams public IReadOnlyCollection<UpstreamPipe> ActiveUpstreams
=> _currentListener?.ActiveUpstreams ?? Array.Empty<UpstreamPipe>(); => _currentListener?.ActiveUpstreams ?? Array.Empty<UpstreamPipe>();
@@ -123,7 +139,28 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
/// </summary> /// </summary>
public Task StartAsync(CancellationToken ct) public Task StartAsync(CancellationToken ct)
{ {
// Refuse to re-Start an already-running or already-disposed supervisor. After
// Stop the state machine returns to Stopped and StartAsync can re-arm; the per-
// Start state (CTS, TCS) is refreshed below so no leak or stale signal carries
// across cycles.
if (_disposed)
throw new ObjectDisposedException(nameof(PlcListenerSupervisor));
if (_state != SupervisorState.Stopped || !_supervisorTask.IsCompleted)
throw new InvalidOperationException(
$"Supervisor for Plc='{_plc.Name}' has already been started.");
// Dispose the previous CTS before reassigning so a re-Start cycle does not leak
// the prior CTS (and any registrations linked to it). Idempotent: the
// ObjectDisposed catch covers the very-first-Start case where the field-init CTS
// is still fresh.
try { _supervisorCts.Dispose(); } catch (ObjectDisposedException) { /* fresh */ }
_supervisorCts = CancellationTokenSource.CreateLinkedTokenSource(ct); _supervisorCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
// Re-arm the first-attempt TCS so a re-Started supervisor doesn't immediately
// observe the previous run's signal in WaitForInitialBindAttemptAsync.
_firstAttemptCompleted = new TaskCompletionSource(
TaskCreationOptions.RunContinuationsAsynchronously);
_supervisorTask = Task.Run(() => RunSupervisorAsync(_supervisorCts.Token), CancellationToken.None); _supervisorTask = Task.Run(() => RunSupervisorAsync(_supervisorCts.Token), CancellationToken.None);
return Task.CompletedTask; return Task.CompletedTask;
} }
@@ -133,13 +170,22 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
/// (transitioned to <see cref="SupervisorState.Bound"/> or /// (transitioned to <see cref="SupervisorState.Bound"/> or
/// <see cref="SupervisorState.Recovering"/>). /// <see cref="SupervisorState.Recovering"/>).
/// Returns immediately if the supervisor is already past that point. /// Returns immediately if the supervisor is already past that point.
///
/// <para>Backed by a <see cref="TaskCompletionSource"/> set when the supervisor task
/// first transitions out of <see cref="SupervisorState.Stopped"/>. This avoids both
/// racing fast bind+stop sequences and hanging if the supervisor task throws before
/// any state write happens.</para>
/// </summary> /// </summary>
public async Task WaitForInitialBindAttemptAsync(CancellationToken ct) public async Task WaitForInitialBindAttemptAsync(CancellationToken ct)
{ {
while (_state == SupervisorState.Stopped && !ct.IsCancellationRequested if (_firstAttemptCompleted.Task.IsCompleted) return;
&& !_supervisorTask.IsCompleted) try
{ {
await Task.Delay(10, ct).ConfigureAwait(false); await _firstAttemptCompleted.Task.WaitAsync(ct).ConfigureAwait(false);
}
catch (OperationCanceledException)
{
// Caller cancelled; not a fault.
} }
} }
@@ -173,23 +219,50 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
} }
} }
/// <summary>Returns a point-in-time snapshot of this supervisor's state.</summary> /// <summary>
public SupervisorSnapshot Snapshot() => new( /// Returns a point-in-time snapshot of this supervisor's state.
State: _state, ///
LastBindError: _lastBindError, /// <para>Reads the three observable fields under a single lock so the status page
RecoveryAttempts: Interlocked.CompareExchange(ref _recoveryAttempts, 0, 0)); /// can never report inconsistent triples like
/// <c>(State=Bound, LastBindError=&lt;previous&gt;, RecoveryAttempts&gt;0)</c>. The
/// supervisor task uses <see cref="TransitionTo"/> which takes the same lock, so a
/// snapshot reads a transition-consistent view.</para>
/// </summary>
public SupervisorSnapshot Snapshot()
{
lock (_snapshotLock)
{
return new SupervisorSnapshot(
State: _state,
LastBindError: _lastBindError,
RecoveryAttempts: _recoveryAttempts);
}
}
private readonly object _snapshotLock = new();
/// <summary>
/// Atomic three-field transition. State, lastBindError, and (optionally) the
/// recoveryAttempts increment all happen under one lock so a concurrent
/// <see cref="Snapshot"/> never sees a half-applied transition.
/// </summary>
private void TransitionTo(SupervisorState newState, string? lastBindError, bool incrementRecoveryAttempt)
{
lock (_snapshotLock)
{
_state = newState;
_lastBindError = lastBindError;
if (incrementRecoveryAttempt)
_recoveryAttempts++;
}
}
/// <summary> /// <summary>
/// Atomically swaps the per-PLC context (tag map + optional response cache) on the /// Atomically swaps the per-PLC context (tag map + optional response cache) on the
/// running listener AND its live multiplexer. /// running listener AND its live multiplexer. The swap propagates into the running
/// /// mux via <see cref="PlcMultiplexer.ReplaceContext"/>, so the very next PDU sees
/// <para><b>Phase 12 (W1.1)</b> — previously this method only updated the supervisor's /// the new tag map / new cache. Counters are preserved (the new context carries the
/// <c>_currentContext</c> slot, which meant the running <see cref="PlcMultiplexer"/> /// same <c>ProxyCounters</c> instance) so operator history is not reset.
/// kept using the OLD context (it captured the reference at construction). A reload
/// only became visible on the next listener fault. Now the swap propagates into the
/// running mux via <see cref="PlcMultiplexer.ReplaceContext"/>, so the very next PDU
/// sees the new tag map / new cache. Counters are preserved (the new context carries
/// the same <c>ProxyCounters</c> instance) so operator history is not reset.</para>
/// ///
/// <para><b>Old cache lifecycle</b>: the supervisor disposes the outgoing context's /// <para><b>Old cache lifecycle</b>: the supervisor disposes the outgoing context's
/// cache AFTER the multiplexer has been swapped to the new context. By that point no /// cache AFTER the multiplexer has been swapped to the new context. By that point no
@@ -204,18 +277,22 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
// subsequent fault recovery) will pick up newCtx through this slot. // subsequent fault recovery) will pick up newCtx through this slot.
_currentContext = newCtx; _currentContext = newCtx;
// Phase 12 (W1.1) — push the swap into the running multiplexer so existing // Push the swap into the running multiplexer so existing connections see the new
// connections see the new tag map / new cache on their next PDU. _currentListener // tag map / new cache on their next PDU. _currentListener may be null between
// may be null between Polly retry attempts; in that case the next listener built // Polly retry attempts; in that case the next listener built inside the Polly loop
// inside the Polly loop will pick up newCtx through _currentContext above. // will pick up newCtx through _currentContext above.
_currentListener?.Multiplexer?.ReplaceContext(newCtx); _currentListener?.Multiplexer?.ReplaceContext(newCtx);
// Phase 12 (W1.1 + W2.8 prereq) — drop the outgoing cache AFTER the swap so the // Drop the outgoing cache AFTER the swap so the running multiplexer can no longer
// running multiplexer can no longer reach it. Dispose stops the eviction loop and // reach it. Clear() snapshots the entry count for the mbproxy.cache.flushed log
// releases the timer. (The cache.flushed log event is W2.8 work; this Wave-1 fix // event before disposing the cache (which stops the eviction loop and releases
// is the "no longer in use, safe to drop" piece.) // the timer).
if (oldCache is not null && !ReferenceEquals(oldCache, newCtx.Cache)) if (oldCache is not null && !ReferenceEquals(oldCache, newCtx.Cache))
{
int dropped = oldCache.Clear();
CacheLogEvents.Flushed(_logger, _plc.Name, "tag-list-reload", dropped);
oldCache.Dispose(); oldCache.Dispose();
}
return Task.CompletedTask; return Task.CompletedTask;
} }
@@ -237,11 +314,11 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
// A faulted listener's TcpListener socket must be disposed before // A faulted listener's TcpListener socket must be disposed before
// re-binding. We create a new PlcListener on each attempt. // re-binding. We create a new PlcListener on each attempt.
// //
// Phase 06: use _currentContext (volatile) so that a ReplaceContextAsync // Use _currentContext (volatile) so that a ReplaceContextAsync call
// call between Polly retry attempts is picked up here. Each listener // between Polly retry attempts is picked up here. Each listener captures
// captures the context at construction time; existing in-flight pairs // the context at construction time; existing in-flight pairs keep their
// keep their own reference. See ReplaceContextAsync for the transition // own reference. See ReplaceContextAsync for the transition window
// window documentation. // documentation.
var listener = new PlcListener( var listener = new PlcListener(
_plc, _plc,
_connectionOptions, _connectionOptions,
@@ -251,9 +328,10 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_pipeLogger, _pipeLogger,
_currentContext, _currentContext,
_backendConnectPipeline, _backendConnectPipeline,
_coalescingOptions); _coalescingOptions,
_keepaliveOptions);
// Phase 07: expose the current listener for status-page pair enumeration. // Expose the current listener for status-page pair enumeration.
_currentListener = listener; _currentListener = listener;
try try
@@ -268,13 +346,12 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_currentListener = null; _currentListener = null;
await listener.DisposeAsync().ConfigureAwait(false); await listener.DisposeAsync().ConfigureAwait(false);
Interlocked.Increment(ref _recoveryAttempts); string truncated = Truncate(bindEx.Message, 256);
string reason = bindEx.Message; TransitionTo(SupervisorState.Recovering, truncated, incrementRecoveryAttempt: true);
string truncated = reason.Length > 256 ? reason[..256] : reason; // Signal the first transition out of Stopped.
_lastBindError = truncated; _firstAttemptCompleted.TrySetResult();
_state = SupervisorState.Recovering;
// Also update the per-PLC counters if available (Phase 07 reads these). // Also update the per-PLC counters if available (status page reads these).
_currentContext?.Counters.IncrementRecoveryAttempt(truncated); _currentContext?.Counters.IncrementRecoveryAttempt(truncated);
LogBindFailed(_logger, _plc.Name, _plc.ListenPort, truncated); LogBindFailed(_logger, _plc.Name, _plc.ListenPort, truncated);
@@ -297,9 +374,10 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
} }
// Clear the last bind error on a successful bind. // Clear the last bind error on a successful bind.
_lastBindError = null; TransitionTo(SupervisorState.Bound, lastBindError: null, incrementRecoveryAttempt: false);
_currentContext?.Counters.ClearLastBindError(); _currentContext?.Counters.ClearLastBindError();
_state = SupervisorState.Bound; // Signal the first transition out of Stopped.
_firstAttemptCompleted.TrySetResult();
// ── Run the accept loop ────────────────────────────────────────── // ── Run the accept loop ──────────────────────────────────────────
// RunAsync returns when: (a) token is cancelled (normal shutdown), // RunAsync returns when: (a) token is cancelled (normal shutdown),
@@ -324,10 +402,11 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
_currentListener = null; _currentListener = null;
await listener.DisposeAsync().ConfigureAwait(false); await listener.DisposeAsync().ConfigureAwait(false);
Interlocked.Increment(ref _recoveryAttempts); string truncated = Truncate(runEx.Message, 256);
string truncated = runEx.Message.Length > 256 ? runEx.Message[..256] : runEx.Message; TransitionTo(SupervisorState.Recovering, truncated, incrementRecoveryAttempt: true);
_lastBindError = truncated; // Also signal first-attempt-completed in case the very first
_state = SupervisorState.Recovering; // listener.RunAsync faulted before the bind-success path signalled it.
_firstAttemptCompleted.TrySetResult();
// Also update the per-PLC counters if available. // Also update the per-PLC counters if available.
_currentContext?.Counters.IncrementRecoveryAttempt(truncated); _currentContext?.Counters.IncrementRecoveryAttempt(truncated);
@@ -346,10 +425,8 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
// Otherwise (listener closed without cancellation — e.g., OS event), // Otherwise (listener closed without cancellation — e.g., OS event),
// treat as a fault and re-enter recovery. // treat as a fault and re-enter recovery.
Interlocked.Increment(ref _recoveryAttempts);
const string unexpectedEnd = "Listener accept loop ended unexpectedly"; const string unexpectedEnd = "Listener accept loop ended unexpectedly";
_lastBindError = unexpectedEnd; TransitionTo(SupervisorState.Recovering, unexpectedEnd, incrementRecoveryAttempt: true);
_state = SupervisorState.Recovering;
_currentContext?.Counters.IncrementRecoveryAttempt(unexpectedEnd); _currentContext?.Counters.IncrementRecoveryAttempt(unexpectedEnd);
LogListenerEnded(_logger, _plc.Name, _plc.ListenPort); LogListenerEnded(_logger, _plc.Name, _plc.ListenPort);
throw new InvalidOperationException(unexpectedEnd); throw new InvalidOperationException(unexpectedEnd);
@@ -369,11 +446,26 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
} }
finally finally
{ {
_state = SupervisorState.Stopped; // Snapshot consistency: state goes back to Stopped without changing the last
// bind error so operators can still see WHY the supervisor exited.
lock (_snapshotLock)
{
_state = SupervisorState.Stopped;
}
_currentListener = null; _currentListener = null;
// Defensive: if RunSupervisorAsync exits before any bind attempt fired
// (e.g. construction-time fault), unblock any awaiting
// WaitForInitialBindAttemptAsync caller so it doesn't hang.
_firstAttemptCompleted.TrySetResult();
} }
} }
/// <summary>
/// Single helper for the truncate-exception-message pattern shared across the
/// supervisor's bind/run/end recovery paths.
/// </summary>
private static string Truncate(string s, int max) => s.Length > max ? s[..max] : s;
// ── IAsyncDisposable ───────────────────────────────────────────────────────────────── // ── IAsyncDisposable ─────────────────────────────────────────────────────────────────
public async ValueTask DisposeAsync() public async ValueTask DisposeAsync()
@@ -391,8 +483,8 @@ internal sealed partial class PlcListenerSupervisor : IAsyncDisposable
// Best-effort cleanup. // Best-effort cleanup.
} }
// Phase 11: dispose the response cache (if any) — its eviction timer would // Dispose the response cache (if any) — its eviction timer would otherwise
// otherwise outlive the supervisor. // outlive the supervisor.
_currentContext?.Cache?.Dispose(); _currentContext?.Cache?.Dispose();
_supervisorCts.Dispose(); _supervisorCts.Dispose();
@@ -26,14 +26,14 @@ public enum SupervisorState
} }
/// <summary> /// <summary>
/// Immutable point-in-time snapshot of a supervisor's state. Consumed by Phase 07's /// Immutable point-in-time snapshot of a supervisor's state. Consumed by the status
/// status page via <see cref="PlcListenerSupervisor.Snapshot"/>. /// page via <see cref="PlcListenerSupervisor.Snapshot"/>.
/// ///
/// <para><b>RecoveryAttempts semantics</b>: this counter <em>accumulates over the lifetime /// <para><b>RecoveryAttempts semantics</b>: this counter <em>accumulates over the lifetime
/// of the supervisor</em> and is never reset. Operators reading the status page should /// of the supervisor</em> and is never reset. Operators reading the status page should
/// interpret it as "how many times has this listener faulted or failed to bind since /// interpret it as "how many times has this listener faulted or failed to bind since
/// the service started" — useful for detecting port-flapping or repeated OS network /// the service started" — useful for detecting port-flapping or repeated OS network
/// resets. Phase 07 surfaces it as-is.</para> /// resets.</para>
/// </summary> /// </summary>
/// <param name="State">Current state of the supervisor.</param> /// <param name="State">Current state of the supervisor.</param>
/// <param name="LastBindError"> /// <param name="LastBindError">
+1 -1
View File
@@ -2,7 +2,7 @@ namespace Mbproxy;
/// <summary> /// <summary>
/// Service-wide counters for the mbproxy host. Tracks reload accept/reject counts and /// Service-wide counters for the mbproxy host. Tracks reload accept/reject counts and
/// timestamps so Phase 07's status page can surface them without coupling to the reconciler. /// timestamps so the status page can surface them without coupling to the reconciler.
/// ///
/// <para>Constructed once at DI startup and shared as a singleton. All writes are via /// <para>Constructed once at DI startup and shared as a singleton. All writes are via
/// dedicated methods that use <see cref="Interlocked"/> so reads from the status page /// dedicated methods that use <see cref="Interlocked"/> so reads from the status page
-54
View File
@@ -1,54 +0,0 @@
{
"Mbproxy": {
"BcdTags": {
"Global": []
},
"Plcs": [],
"AdminPort": 8080,
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000
},
"Resilience": {
"BackendConnect": {
"MaxAttempts": 3,
"BackoffMs": [ 100, 500, 2000 ]
},
"ListenerRecovery": {
"InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ],
"SteadyStateMs": 30000
},
"ReadCoalescing": {
"Enabled": true,
"MaxParties": 32
}
}
},
"Serilog": {
"Using": [ "Serilog.Sinks.Console", "Serilog.Sinks.File" ],
"MinimumLevel": {
"Default": "Information",
"Override": {
"Microsoft": "Warning",
"System": "Warning"
}
},
"WriteTo": [
{
"Name": "Console",
"Args": {
"outputTemplate": "[{Timestamp:HH:mm:ss} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}"
}
},
{
"Name": "File",
"Args": {
"path": "C:\\ProgramData\\mbproxy\\logs\\mbproxy-.log",
"rollingInterval": "Day",
"retainedFileCountLimit": 30,
"outputTemplate": "[{Timestamp:yyyy-MM-dd HH:mm:ss.fff zzz} {Level:u3}] {Message:lj} {Properties:j}{NewLine}{Exception}"
}
}
]
}
}
@@ -330,6 +330,40 @@ public sealed class AdminEndpointTests
System.IO.File.Move(tmp, path, overwrite: true); System.IO.File.Move(tmp, path, overwrite: true);
} }
// ── non-GET methods rejected ─────────────────────────────────────────
/// <summary>
/// Verifies the admin endpoint rejects non-GET methods (POST / PUT / DELETE)
/// with HTTP 405 Method Not Allowed. The design intentionally exposes only `GET /`
/// and `GET /status.json`; this test guards against an accidental MapPost/Map* being
/// added later.
/// </summary>
[Theory(Timeout = 5_000)]
[InlineData("POST")]
[InlineData("PUT")]
[InlineData("DELETE")]
[InlineData("PATCH")]
public async Task NonGetMethod_AgainstAdminRoutes_Returns405(string method)
{
int adminPort = PickFreePort();
int proxyPort = PickFreePort();
var host = BuildHost(adminPort: adminPort, simHost: "127.0.0.1", simPort: 502,
proxyPort: proxyPort, bcd16Addresses: []);
await using var _ = new AsyncHostDispose(host);
await host.StartAsync(TestContext.Current.CancellationToken);
await WaitForAdminAsync(adminPort);
foreach (string path in new[] { "/", "/status.json" })
{
using var req = new HttpRequestMessage(new HttpMethod(method),
$"http://127.0.0.1:{adminPort}{path}");
using var resp = await HttpClient.SendAsync(req, TestContext.Current.CancellationToken);
resp.StatusCode.ShouldBe(HttpStatusCode.MethodNotAllowed,
$"{method} {path} must be rejected (admin endpoint is read-only)");
}
}
// ── Helpers ─────────────────────────────────────────────────────────────── // ── Helpers ───────────────────────────────────────────────────────────────
private static IHost BuildHost( private static IHost BuildHost(
@@ -43,17 +43,19 @@ public sealed class StatusHtmlRendererTests
ListenPort: 5020, ListenPort: 5020,
Listener: new PlcListenerStatus(state, lastBindError, recoveryAttempts), Listener: new PlcListenerStatus(state, lastBindError, recoveryAttempts),
Clients: new PlcClientsStatus(clients?.Count ?? 0, clients ?? noClients), Clients: new PlcClientsStatus(clients?.Count ?? 0, clients ?? noClients),
Pdus: new PlcPdusStatus(100, new FcCounts(50, 10, 20, 15, 5), 30, 2), Pdus: new PlcPdusStatus(100, new FcCounts(50, 10, 20, 15, 5), 30, 2, 0),
Backend: new PlcBackendStatus( Backend: new PlcBackendStatus(
ConnectsSuccess: 0, ConnectsFailed: 0, ConnectsSuccess: 0, ConnectsFailed: 0,
ExceptionsByCode: new ExceptionCounts(1, 0, 0, 0), ExceptionsByCode: new ExceptionCounts(1, 0, 0, 0, 0),
LastRoundTripMs: 3.5, LastRoundTripMs: 3.5,
InFlight: 0, MaxInFlight: 0, TxIdWraps: 0, InFlight: 0, MaxInFlight: 0, TxIdWraps: 0,
DisconnectCascades: 0, QueueDepth: 0, DisconnectCascades: 0, QueueDepth: 0,
CoalescedHitCount: 0, CoalescedMissCount: 0, CoalescedHitCount: 0, CoalescedMissCount: 0,
CoalescedResponseToDeadUpstream: 0, CoalescedResponseToDeadUpstream: 0,
CacheHitCount: 0, CacheMissCount: 0, CacheHitCount: 0, CacheMissCount: 0,
CacheInvalidations: 0, CacheEntryCount: 0, CacheBytes: 0), CacheInvalidations: 0, CacheEntryCount: 0, CacheBytes: 0,
BackendHeartbeatsSent: 0, BackendHeartbeatsFailed: 0,
BackendIdleDisconnects: 0),
Bytes: new PlcBytesStatus(1024, 2048)); Bytes: new PlcBytesStatus(1024, 2048));
} }
@@ -10,8 +10,8 @@ namespace Mbproxy.Tests.Bcd;
/// NOTE on allocation profile: /// NOTE on allocation profile:
/// BcdCodec is a purely static class operating on value types (ushort, int, tuples). /// BcdCodec is a purely static class operating on value types (ushort, int, tuples).
/// It allocates only when constructing exception objects (the error path), never on /// It allocates only when constructing exception objects (the error path), never on
/// the success path. TryGet / hot-path decode callers in Phase 04 will be /// the success path. TryGet / hot-path decode callers are allocation-free for valid
/// allocation-free for valid BCD registers. /// BCD registers.
/// </summary> /// </summary>
[Trait("Category", "Unit")] [Trait("Category", "Unit")]
public sealed class BcdCodecTests public sealed class BcdCodecTests
@@ -44,6 +44,20 @@ public sealed class BcdCodecTests
.ParamName.ShouldBe("value"); .ParamName.ShouldBe("value");
} }
/// <summary>
/// Locks the boundary contract for the `(uint)value > Max16` range check.
/// `int.MinValue` cast to `uint` becomes `0x80000000`, which is well above
/// `Max16` (= 9999), so the throw fires cleanly without arithmetic surprise. Prevents
/// regressions if the bounds check is ever rewritten with a two-sided int comparison
/// that would underflow on extreme negatives.
/// </summary>
[Fact]
public void Encode16_IntMinValue_Throws_OutOfRange_NoArithmeticSurprise()
{
Should.Throw<ArgumentOutOfRangeException>(() => BcdCodec.Encode16(int.MinValue))
.ParamName.ShouldBe("value");
}
// ── Decode16 ──────────────────────────────────────────────────────────── // ── Decode16 ────────────────────────────────────────────────────────────
[Fact] [Fact]
@@ -98,62 +98,71 @@ public sealed class BcdTagMapBuilderTests
tag.Width.ShouldBe((byte)32); tag.Width.ShouldBe((byte)32);
} }
/// <summary>
/// Duplicates within Global itself are detected pre-collapse and produce a
/// DuplicateAddress error. (A naive input dictionary would silently collapse
/// to last-write-wins, leaving the validator dead.)
/// </summary>
[Fact] [Fact]
public void Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError() public void Build_DuplicateAddressInGlobal_ReturnsDuplicateAddressError()
{ {
// Two options with the same address in Global.
// The working dictionary collapses them (last-write-wins),
// so a true duplicate is one in Add that matches Global after step 3
// has already resolved — which the builder handles as "Add wins" (no error).
// This test instead validates the case where Global has a structural duplicate
// after the full resolution results in one address appearing twice, which can
// happen if the options list is constructed with the same address twice.
var global = new BcdTagListOptions var global = new BcdTagListOptions
{ {
Global = Global =
[ [
new BcdTagOptions { Address = 1072, Width = 16 }, new BcdTagOptions { Address = 1072, Width = 16 },
new BcdTagOptions { Address = 1072, Width = 32 }, // same address, different width new BcdTagOptions { Address = 1072, Width = 32 }, // duplicate within Global
] ]
}; };
// The dictionary collapses to one entry (last-write-wins in the dictionary).
// A real duplicate-detection scenario: two separately-identical entries through Add.
// Let's construct a true duplicate through the Add path overwriting Global
// and then adding the same address again.
// Actually: our builder uses Dictionary<ushort, BcdTagOptions> which deduplicates
// by key. The DuplicateAddress error fires when seenAddresses already contains addr,
// which can only happen if working has two entries with the same key — but Dictionary
// prevents that. The correct scenario is: two Add entries with the same address in
// the IReadOnlyList (list allows duplication even though dict collapses them).
// Since the builder iterates the list and adds to dict, duplicates in the list
// get silently resolved. The DuplicateAddress error is thus for a theoretical
// future path; let's verify the "Add with same address as existing" path instead.
var result = BcdTagMapBuilder.Build(global, perPlc: null); var result = BcdTagMapBuilder.Build(global, perPlc: null);
// Should resolve cleanly (dict collapses to last write). result.Errors.ShouldContain(e => e.Kind == BcdValidationError.DuplicateAddress
result.Errors.ShouldBeEmpty(); && e.Address == 1072);
result.Map.Count.ShouldBe(1);
} }
/// <summary>
/// Duplicates within the per-PLC Add list itself are detected pre-collapse.
/// (Cross-list collisions Global vs Add remain the legitimate width-override
/// pattern and are NOT errors — see the next test.)
/// </summary>
[Fact] [Fact]
public void Build_DuplicateAddress_Via_AddList_Produces_No_Error_LastWriteWins() public void Build_DuplicateAddress_Within_AddList_ReturnsDuplicateAddressError()
{ {
// The Add list has two entries for the same address; builder sees the last one.
// This is intentional: it allows width overrides. No duplicate error expected.
var global = Global((1072, 16)); var global = Global((1072, 16));
var perPlc = new PlcBcdOverrides var perPlc = new PlcBcdOverrides
{ {
Add = Add =
[ [
new BcdTagOptions { Address = 1072, Width = 16 }, new BcdTagOptions { Address = 1080, Width = 16 },
new BcdTagOptions { Address = 1072, Width = 32 }, // override the first Add new BcdTagOptions { Address = 1080, Width = 32 }, // duplicate within Add
], ],
Remove = [], Remove = [],
}; };
var result = BcdTagMapBuilder.Build(global, perPlc); var result = BcdTagMapBuilder.Build(global, perPlc);
result.Errors.ShouldContain(e => e.Kind == BcdValidationError.DuplicateAddress
&& e.Address == 1080);
}
/// <summary>
/// Same-address entries appearing in BOTH Global AND Add are the documented
/// width-override pattern (docs/Features/BcdRewriting.md "Hybrid tag resolution"). They must NOT
/// be flagged as duplicates; Add wins.
/// </summary>
[Fact]
public void Build_AddOverridesGlobalAtSameAddress_NoDuplicateError_AddWins()
{
var global = Global((1072, 16));
var perPlc = new PlcBcdOverrides
{
Add = [ new BcdTagOptions { Address = 1072, Width = 32 } ],
Remove = [],
};
var result = BcdTagMapBuilder.Build(global, perPlc);
result.Errors.ShouldBeEmpty(); result.Errors.ShouldBeEmpty();
result.Map.TryGet(1072, out var tag).ShouldBeTrue(); result.Map.TryGet(1072, out var tag).ShouldBeTrue();
tag.Width.ShouldBe((byte)32); tag.Width.ShouldBe((byte)32);
@@ -109,7 +109,7 @@ public sealed class ConfigReconcilerTests : IAsyncDisposable
_supervisors.Add(supA); _supervisors.Add(supA);
await supA.StartAsync(CancellationToken.None); await supA.StartAsync(CancellationToken.None);
var supervisors = new Dictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal) var supervisors = new System.Collections.Concurrent.ConcurrentDictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal)
{ {
["A"] = supA, ["A"] = supA,
}; };
@@ -149,7 +149,7 @@ public sealed class ConfigReconcilerTests : IAsyncDisposable
_supervisors.Add(supA); _supervisors.Add(supA);
await supA.StartAsync(CancellationToken.None); await supA.StartAsync(CancellationToken.None);
var supervisors = new Dictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal) var supervisors = new System.Collections.Concurrent.ConcurrentDictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal)
{ {
["A"] = supA, ["A"] = supA,
}; };
@@ -207,7 +207,7 @@ public sealed class ConfigReconcilerTests : IAsyncDisposable
await supA.WaitForInitialBindAttemptAsync(waitCts.Token); await supA.WaitForInitialBindAttemptAsync(waitCts.Token);
Assert.Equal(SupervisorState.Bound, supA.Snapshot().State); Assert.Equal(SupervisorState.Bound, supA.Snapshot().State);
var supervisors = new Dictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal) var supervisors = new System.Collections.Concurrent.ConcurrentDictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal)
{ {
["A"] = supA, ["A"] = supA,
}; };
@@ -245,7 +245,7 @@ public sealed class ConfigReconcilerTests : IAsyncDisposable
var counters = new ServiceCounters(); var counters = new ServiceCounters();
var reconciler = BuildReconciler(monitor, counters); var reconciler = BuildReconciler(monitor, counters);
_reconcilers.Add(reconciler); _reconcilers.Add(reconciler);
reconciler.Attach(new Dictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal), initial); reconciler.Attach(new System.Collections.Concurrent.ConcurrentDictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal), initial);
// Fire 5 concurrent Apply calls — they must execute one-at-a-time. // Fire 5 concurrent Apply calls — they must execute one-at-a-time.
var opts = MakeOptions([]); var opts = MakeOptions([]);
@@ -280,6 +280,72 @@ public sealed class ConfigReconcilerTests : IAsyncDisposable
// under concurrent load. // under concurrent load.
Assert.Equal(5, counters.ReloadAppliedCount); Assert.Equal(5, counters.ReloadAppliedCount);
} }
/// <summary>
/// Stress-tests the live supervisor dictionary and the coalescing-accessor wiring.
/// Many concurrent Apply calls drive add/remove of many distinct PLCs; the inner
/// Task.WhenAll continuations must not corrupt the dictionary or crash with
/// KeyNotFoundException or ArgumentException. The test asserts: all applies
/// complete, no exceptions are thrown, and the reload counter is exactly the
/// apply count.
/// </summary>
[Fact(Timeout = 30_000)]
public async Task Apply_ManyConcurrentReloads_With_PlcChurn_NoCorruption()
{
// Empty initial — first Apply will Add all PLCs.
var initial = MakeOptions([]);
var monitor = new FakeOptionsMonitor(initial);
var supervisors = new System.Collections.Concurrent.ConcurrentDictionary<string, PlcListenerSupervisor>(StringComparer.Ordinal);
var counters = new ServiceCounters();
var reconciler = BuildReconciler(monitor, counters);
_reconcilers.Add(reconciler);
reconciler.Attach(supervisors, initial);
// Build 8 different option snapshots, each a different PLC roster.
// Each Apply will trigger Add+Remove churn against the live supervisor dict —
// exactly the path that the ConcurrentDictionary guards against corruption.
const int snapshots = 8;
const int plcsPerSnapshot = 4;
var snaps = new MbproxyOptions[snapshots];
var allPlcs = new List<PlcOptions>();
for (int s = 0; s < snapshots; s++)
{
var plcsForSnap = new PlcOptions[plcsPerSnapshot];
for (int p = 0; p < plcsPerSnapshot; p++)
{
plcsForSnap[p] = MakePlc($"PLC-{s}-{p}", PickFreePort());
allPlcs.Add(plcsForSnap[p]);
}
snaps[s] = MakeOptions(plcsForSnap);
}
using var cts = new CancellationTokenSource(TimeSpan.FromSeconds(25));
// Fire 16 concurrent applies cycling through the 8 snapshots so each is
// submitted twice. Inner per-PLC Task.WhenAll continuations run in parallel
// and stress-test the dictionary mutation safety.
var tasks = Enumerable.Range(0, 16)
.Select(i => Task.Run(() => reconciler.ApplyAsync(snaps[i % snapshots], cts.Token), cts.Token))
.ToArray();
var results = await Task.WhenAll(tasks);
Assert.All(results, r => Assert.True(r, "every Apply must succeed"));
Assert.Equal(16, counters.ReloadAppliedCount);
// Final dictionary state: all keys present must come from the last-applied snapshot.
// The "last-applied snapshot" depends on scheduling so we just verify NO orphan
// entries — every supervisor in the dict must correspond to some snapshot's PLCs.
var validNames = new HashSet<string>(allPlcs.Select(p => p.Name));
foreach (var name in supervisors.Keys)
Assert.Contains(name, validNames);
// Track supervisors for cleanup.
foreach (var s in supervisors.Values)
_supervisors.Add(s);
}
} }
/// <summary> /// <summary>
@@ -320,6 +320,129 @@ public sealed class HotReloadE2ETests : IAsyncLifetime
await host.StopAsync(stopCts.Token); await host.StopAsync(stopCts.Token);
} }
// ── cache flush on tag-list reload ──────────────────────────────────────────────────
/// <summary>
/// Verifies that a tag-list reload for a PLC with a cacheable tag emits
/// <c>mbproxy.cache.flushed</c>. The cache count is 0 (no real backend to populate
/// it), but the event must still fire — it's the operator's signal that the in-memory
/// cache state was reset by a config reload.
/// </summary>
[Fact(Timeout = 8_000)]
public async Task E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent()
{
int port = PickFreePort();
int adminPort = PickFreePort();
WriteConfigWithCacheableTag(_configPath, port, adminPort, address: 1024, cacheTtlMs: 60_000);
var sink = new HotReloadCapturingSink();
using var host = BuildHost(_configPath, logSink: sink);
using var startCts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
await host.StartAsync(startCts.Token);
await WaitForAsync(() => CanConnect(port), TimeSpan.FromSeconds(5),
"listener should be reachable after startup");
// Mutate the tag list (different address, still cacheable) — this is a Reseat,
// not an Add/Remove, so ReplaceContextAsync runs and the cache flush fires.
WriteConfigWithCacheableTag(_configPath, port, adminPort, address: 1080, cacheTtlMs: 60_000);
// First confirm the reconciler actually applied the reload at all — gives a clearer
// failure mode than a bare timeout if Reseat isn't firing.
await WaitForAsync(
() => sink.Events.Any(e => e.MessageTemplate.Text.Contains("Config reload applied")),
TimeSpan.FromSeconds(5),
"Config reload applied must fire first; verifies reconciler picked up the change");
await WaitForAsync(
() => sink.Events.Any(e => e.MessageTemplate.Text.Contains("Cache flushed")),
TimeSpan.FromSeconds(2),
"expected mbproxy.cache.flushed after tag-list reload on a cacheable PLC");
using var stopCts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
await host.StopAsync(stopCts.Token);
}
// ── ReadCoalescing.Enabled hot-reload flip ──────────────────────────────────────────
/// <summary>
/// Verifies that flipping <c>Mbproxy.Resilience.ReadCoalescing.Enabled</c> at
/// runtime via hot-reload propagates to the live <see cref="IOptionsMonitor{T}"/>
/// snapshot. The accessor is wired through to add/restart supervisors; the
/// multiplexer reads it per-PDU. Proving the IOptionsMonitor sees the new value
/// is sufficient — the per-PDU read path is unit-tested at the multiplexer level.
/// </summary>
[Fact(Timeout = 8_000)]
public async Task E2E_ReadCoalescingEnabled_FlipAtRuntime_PropagatesToOptionsMonitor()
{
int port = PickFreePort();
int adminPort = PickFreePort();
WriteConfigWithCoalescing(_configPath, port, adminPort, enabled: true);
using var host = BuildHost(_configPath);
using var startCts = new CancellationTokenSource(TimeSpan.FromSeconds(10));
await host.StartAsync(startCts.Token);
await WaitForAsync(() => CanConnect(port), TimeSpan.FromSeconds(5),
"listener should be reachable after startup");
var monitor = host.Services
.GetRequiredService<Microsoft.Extensions.Options.IOptionsMonitor<Mbproxy.Options.MbproxyOptions>>();
monitor.CurrentValue.Resilience.ReadCoalescing.Enabled.ShouldBeTrue(
"initial config sets Enabled=true");
// Flip to false and re-save.
WriteConfigWithCoalescing(_configPath, port, adminPort, enabled: false);
await WaitForAsync(
() => monitor.CurrentValue.Resilience.ReadCoalescing.Enabled == false,
TimeSpan.FromSeconds(5),
"IOptionsMonitor.CurrentValue must reflect Enabled=false after hot-reload");
using var stopCts = new CancellationTokenSource(TimeSpan.FromSeconds(5));
await host.StopAsync(stopCts.Token);
}
private static void WriteConfigWithCoalescing(
string path, int listenPort, int adminPort, bool enabled)
{
var doc = new
{
Mbproxy = new
{
AdminPort = adminPort,
BcdTags = new { Global = Array.Empty<object>() },
Plcs = new[] { new { Name = "PLC-A", ListenPort = listenPort, Host = "127.0.0.1", Port = 502 } },
Connection = new { BackendConnectTimeoutMs = 500, BackendRequestTimeoutMs = 500 },
Resilience = new
{
ReadCoalescing = new { Enabled = enabled, MaxParties = 32 },
},
},
};
string tmp = path + ".tmp";
File.WriteAllText(tmp, JsonSerializer.Serialize(doc, new JsonSerializerOptions { WriteIndented = true }));
File.Move(tmp, path, overwrite: true);
}
private static void WriteConfigWithCacheableTag(
string path, int listenPort, int adminPort, int address, int cacheTtlMs)
{
var doc = new
{
Mbproxy = new
{
AdminPort = adminPort,
BcdTags = new { Global = new[] { new { Address = address, Width = 16, CacheTtlMs = cacheTtlMs } } },
Plcs = new[] { new { Name = "PLC-A", ListenPort = listenPort, Host = "127.0.0.1", Port = 502 } },
Connection = new { BackendConnectTimeoutMs = 500, BackendRequestTimeoutMs = 500 },
},
};
string tmp = path + ".tmp";
File.WriteAllText(tmp, JsonSerializer.Serialize(doc, new JsonSerializerOptions { WriteIndented = true }));
File.Move(tmp, path, overwrite: true);
}
// ── Helpers ─────────────────────────────────────────────────────────────────────────── // ── Helpers ───────────────────────────────────────────────────────────────────────────
private static bool CanConnect(int port) private static bool CanConnect(int port)
@@ -155,4 +155,183 @@ public sealed class ReloadValidatorTests
Assert.False(valid); Assert.False(valid);
Assert.Contains(errors, e => e.Contains("non-empty")); Assert.Contains(errors, e => e.Contains("non-empty"));
} }
// ── Cache.AllowLongTtl gate ─────────────────────────────────────────────────────────
/// <summary>
/// Per-tag CacheTtlMs > 60_000 without Cache.AllowLongTtl is rejected.
/// </summary>
[Fact]
public void Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
BcdTags = new BcdTagListOptions
{
Global = [ new BcdTagOptions { Address = 1024, Width = 16, CacheTtlMs = 120_000 } ],
},
Cache = new CacheOptions { AllowLongTtl = false },
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("AllowLongTtl") && e.Contains("60_000"));
}
/// <summary>
/// Same value passes when AllowLongTtl is true (operator opt-in).
/// </summary>
[Fact]
public void Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
BcdTags = new BcdTagListOptions
{
Global = [ new BcdTagOptions { Address = 1024, Width = 16, CacheTtlMs = 120_000 } ],
},
Cache = new CacheOptions { AllowLongTtl = true },
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.True(valid);
Assert.Empty(errors);
}
/// <summary>
/// Per-PLC DefaultCacheTtlMs > 60_000 inherited by a tag with null CacheTtlMs is
/// caught by the resolved-value check even if the per-PLC default check itself
/// passes (it doesn't, but this validates the defensive resolved re-check).
/// </summary>
[Fact]
public void Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [
new PlcOptions
{
Name = "PLC-A", ListenPort = 5020, Host = "127.0.0.1", Port = 502,
DefaultCacheTtlMs = 90_000,
},
],
BcdTags = new BcdTagListOptions
{
// Tag with no explicit CacheTtlMs — inherits the per-PLC 90_000.
Global = [ new BcdTagOptions { Address = 1024, Width = 16 } ],
},
Cache = new CacheOptions { AllowLongTtl = false },
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("60_000"));
}
// ── ConnectionOptions validation ────────────────────────────────────────────────────
[Fact]
public void Validate_ZeroBackendConnectTimeoutMs_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions { BackendConnectTimeoutMs = 0 },
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("BackendConnectTimeoutMs"));
}
[Fact]
public void Validate_NegativeGracefulShutdownTimeoutMs_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions { GracefulShutdownTimeoutMs = -1 },
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("GracefulShutdownTimeoutMs"));
}
// ── Keepalive section ─────────────────────────────────────────────────────
[Fact]
public void Validate_DefaultKeepalive_Passes()
{
// Default ConnectionOptions → default KeepaliveOptions (idle 30 s, request 3 s).
var opts = MakeOptions([MakePlc("PLC-A", 5020)]);
bool valid = ReloadValidator.Validate(opts, out _);
Assert.True(valid);
}
[Fact]
public void Validate_NonPositiveTcpProbeCount_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { TcpProbeCount = 0 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("TcpProbeCount"));
}
[Fact]
public void Validate_OutOfRangeHeartbeatProbeAddress_Fails()
{
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
Keepalive = new KeepaliveOptions { BackendHeartbeatProbeAddress = 70000 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("BackendHeartbeatProbeAddress"));
}
[Fact]
public void Validate_HeartbeatIdleNotAboveRequestTimeout_Fails()
{
// BackendHeartbeatIdleMs must sit ABOVE BackendRequestTimeoutMs, else a heartbeat
// would be timed out as fast as it could be issued.
var opts = new MbproxyOptions
{
Plcs = [MakePlc("PLC-A", 5020)],
Connection = new ConnectionOptions
{
BackendRequestTimeoutMs = 3000,
Keepalive = new KeepaliveOptions { BackendHeartbeatIdleMs = 3000 },
},
};
bool valid = ReloadValidator.Validate(opts, out var errors);
Assert.False(valid);
Assert.Contains(errors, e => e.Contains("BackendHeartbeatIdleMs"));
}
} }

Some files were not shown because too many files have changed in this diff Show More