b330faff0383aed667b3935fddd60b866586618d
13 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
0868613890 |
mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket can be silently dropped by a middlebox (switch, firewall, NAT) after 2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream sockets, and drive a periodic synthetic FC03 heartbeat on each idle backend socket so a dead path is detected before a real client request hits it. Controlled by Connection.Keepalive (ON by default). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7466a46aa7 |
mbproxy/docs: retire superseded design/plan docs and dissolve DL260/
The standalone design.md, kpi.md, operations.md, and the docs/plan/ phase tree were point-in-time planning artefacts now superseded by the topic-organized docs/ tree (Architecture/, Features/, Operations/, Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a test fixture, a sample test, and a screenshot; its contents now live in their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/, dl205.json next to its launcher in tests/sim/, sample test dropped). All cross-references in the surviving docs, README, CLAUDE.md, the config template, and source comments are repointed to the new locations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1a2856526a |
mbproxy: strip historical phase/wave/plan references from source comments
Comments described the *history* of how the code arrived (phase numbers, wave IDs, review IDs, dated TODOs) instead of what it does today. That scaffolding rotted as the codebase evolved. Cleaned 60 source files + .gitignore; behaviour unchanged (387/387 tests still pass). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
b3b8313e9c |
mbproxy: Wave 6 — wire ProxyCounters.AddBytes (bytes counters were always 0)
The 10-min stress test (1.46 M PDUs through the proxy) revealed that
status.json's bytes.upstreamIn / bytes.upstreamOut counters always read 0
because ProxyCounters.AddBytes was defined but never called from anywhere.
Same shape as the original review's W2.22 finding (counter wired in DTO +
HTML but no increment site), missed for the bytes counters specifically.
Wired five increment sites in PlcMultiplexer:
OnUpstreamFrameAsync (request side, parsed frame)
AddBytes(up: frame.Length, down: 0) — counted ONCE per parsed frame
regardless of subsequent routing (cache hit, coalesce, backend
round-trip, exception).
RunBackendReaderAsync fan-out (response side, after TrySendResponse=true)
AddBytes(up: 0, down: outFrame.Length) per delivered party. With
coalescing, one backend response fans out to N parties and produces
N × frame.Length bytes leaving the proxy upstream-side. Drops
(TrySendResponse=false) increment ResponseDropForFullUpstream
instead.
Cache hit path
AddBytes(up: 0, down: hitFrame.Length) for the BuildCacheHitFrame
response (no backend round-trip but still bytes leaving the proxy).
Saturation cleanup (W1.2 path, both branches)
AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x04.
Non-coalescing-path saturation
AddBytes(up: 0, down: excFrame.Length) for the single exception 0x04.
Watchdog timeout exception delivery
AddBytes(up: 0, down: excFrame.Length) per delivered exception 0x0B.
Backend-side bytes (proxy ↔ PLC) are NOT counted by these counters — the
field name is `BytesUpstreamIn/Out` which is upstream-only by contract.
Tests: 387 pass / 0 fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
59d0b5deb9 |
mbproxy: Wave 5 — fixes from third re-review pass
Closes findings from the third focused re-review pass on the post-W4-followup state (recorded in codereviews/2026-05-14/ReReviewAfterRemediation.md). W5/M1 — AdminEndpointHost OnChange callback can resurrect Kestrel after StopAsync The hot-reload OnChange handler at AdminEndpointHost.StartAsync did fire-and-forget `_ = Task.Run(...)` with no _disposed check. If AdminPort was hot-reloaded during shutdown, the queued Task could land between StopAsync's registration-dispose and DisposeAsync's _lock-dispose, take the lock, and bind a fresh Kestrel WebApplication on the new port — resurrecting admin AFTER the host considered it shut down. Worse, if DisposeAsync had already run _lock.Dispose, the queued Task throws ObjectDisposedException as an unobserved Task exception. Fix: _disposed guard at the top of the OnChange lambda AND inside the queued Task.Run, plus try/catch (ObjectDisposedException) around _lock.WaitAsync and _lock.Release. W5/m2 — inFlightAtCancel computed AFTER base.StopAsync The W4/NC1 fix correctly snapshotted inFlight BEFORE supervisor.StopAsync (so the multiplexers' counter providers were still wired), but it computed the snapshot AFTER base.StopAsync(cancellationToken). Between those two lines, in-flight requests whose responses arrive get removed from _correlation, and the watchdog can clear stale entries. The reported count therefore drifted downward from "in-flight at signal time" to "in-flight at compute time." Fix: snapshot at the very top of StopAsync before any cancellation is propagated. W5/m1 — Cascade gate-not-held path race (accepted as documented best-effort) When TearDownBackendAsync's _connectGate.WaitAsync(2s) times out, the body runs unprotected. A concurrent EnsureBackendConnectedAsync that DOES hold the gate may TryAllocate a TxId that collides (after wraparound in the allocator's forward scan) with one being released by the channel drain. The double-release would mark the new request's slot as free even though it's legitimately in-flight, allowing the next allocation to reuse the same slot and CorrelationMap.TryAdd to fail (silent request drop). Probability is very low (gate timeout AND new accept landing AND TxId collision in 65,536-slot space); the only consequence is one dropped request the client retries. Documented inline at PlcMultiplexer.cs near the gateHeld declaration as accepted best-effort behaviour. W5/m3 — CountInFlight allocates a CounterSnapshot record per supervisor Trivial (~5 KB on a 54-PLC fleet, called once per shutdown). Skipped per re-review verdict. Tests: 387 pass / 0 fail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
9251c564c1 |
mbproxy: resolve remaining items from ReReviewAfterRemediation.md
Closes the latent + minor + test-discipline items left after Wave 4. Updates
the re-review doc with a final resolution table — every actionable finding
now marked Resolved or Accepted with rationale.
NM3 — _supervisorCts leaks on re-Start
StartAsync now disposes the previous CTS before reassigning. Idempotent:
a try/catch (ObjectDisposedException) covers the very-first-Start case
where the field-init CTS is still fresh.
NM4 — W2.15 TCS is single-shot
_firstAttemptCompleted is no longer readonly; StartAsync re-creates it
after the W2.16 guard so a re-Started supervisor's
WaitForInitialBindAttemptAsync doesn't observe the previous run's signal.
Nm6 — _admin GetService<> returns null silently
ProxyWorker.ExecuteAsync now logs a Warning when admin isn't registered.
Preserves the loud-failure intent from the original IHostedService
registration without forcing test hosts to wire admin.
Nm7 — AdminEndpointHost.DisposeAsync no double-dispose guard
Added a volatile bool _disposed flag with an early-return at the top of
DisposeAsync. Symmetry with PlcMultiplexer; protects against
ProxyWorker.StopAsync explicitly disposing then DI disposing the singleton
again on host shutdown.
T3 — RemoveInheritedAppsettings only fires on Build
AfterTargets="Build;Publish" + a second Delete against $(PublishDir)
so a `dotnet publish` against the test csproj doesn't ship the example
PLCs from the linked install template.
T4 — Stale TryAttachOrCreate_*_ReturnsTrue_* test method names
Renamed to AttachOrCreate_*_WasNew{True,False} after W3 dropped the bool
return.
Accepted (with rationale documented in ReReviewAfterRemediation.md):
Nm2 — CoalescedHit semantic is per-design
Nm4 — _lastBindError preservation on clean exit is intentional forensics
Nm5 — EventLogBridge has no injectable logger
Nm8 — Cosmetic log noise
T1 — Reflection on private fields documented as maintenance trap
Tests: 387 pass / 0 fail.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
7a435957ee |
mbproxy: Wave 4 — fix issues introduced by the Wave-1/2 fixes
Closes the new findings from the post-remediation re-review (codereviews/2026-05-14/ReReviewAfterRemediation.md): NC1 — ProxyWorker.StopAsync drain loop is structurally always-zero Wave 1's W1.5 inherited the original ShutdownCoordinator bug it was meant to replace. Supervisor.StopAsync nulls the per-mux counter provider before the drain loop runs, so CountInFlight always returns 0 and the drain budget is never spent on actual draining. Fix: snapshot the in-flight count BEFORE supervisor stop, drop the theatrical post-stop loop, and report InFlightAtCancel as the snapshot count (= the number of in-flight requests dropped by the stop). The supervisor stop IS the drain — there is nothing to drain that wouldn't be killed by the stop itself. NM1 — TearDownBackendAsync._connectGate.WaitAsync uncancellable Without a token, a long Polly-wrapped EnsureBackendConnectedAsync against an unreachable host could hold the gate for the full BackendConnectTimeoutMs * MaxAttempts window, blocking DisposeAsync (and therefore ProxyWorker.StopAsync) for that duration. Fix: bound the wait with a 2 s teardown deadline; on timeout proceed best-effort without the gate. Worst-case consequence is one orphaned in-flight cycle on the dying backend, surfaced to upstream as exception 0x0B by the watchdog. NM2 — ReplaceContext non-atomic ctx + provider swap Snapshot path reads `_cacheStatsProvider` independently of `_ctx`. If `_ctx` was swapped first, a snapshot taken in the gap would still hold the OLD adapter wrapping the OLD cache — which the supervisor disposes immediately after we return. Fix: set the provider FIRST, then swap `_ctx`. Snapshots in the swap window now read either (old, old) or (new, new), never (old-after-disposed). NM5 — Self-cascade ObjectDisposedException after dispose Writer/reader fault catches fired `_ = TearDownBackendAsync(...)` unconditionally. After DisposeAsync runs `_connectGate.Dispose()`, the fire-and-forget TearDown threw ObjectDisposedException on WaitAsync as an unobserved Task exception. Fix: skip self-cascade when `_disposeCts.IsCancellationRequested` — DisposeAsync runs an explicit TearDown anyway. Nm1 — Saturation cleanup uses await SendResponseAsync W1.2's per-attacher delivery loop awaited the blocking SendResponseAsync, which would serialise on a wedged late-attacher's full bounded channel and stall delivery to its peers — contradicting the W1.3 doctrine that the fan-out path must never await per-pipe writes. Fix: use TrySendResponse and increment ResponseDropForFullUpstream on drop. T2 — WatchdogVsResponse_Race seeded Random fragility Used `new Random(12345)` over [350, 450) ms with watchdog at 400 ms; Random's algorithm is implementation-defined across .NET major versions (legacy → Xoshiro128 in .NET 6) so a runtime upgrade could land all samples on one side of the deadline and break the "both branches must fire" assertion. Fix: deterministic counter-based alternation (15 fast + 15 slow across 30 iterations) — guaranteed by construction. Latent items NM3 (_supervisorCts leak on re-Start) and NM4 (TCS single-shot semantics) are unfixed: no caller actually re-Starts a supervisor today; both become real only if the reconciler ever changes to re-Start instead of dispose-and-rebuild. Documented in the re-review. Tests: 387 pass / 0 fail. Three back-to-back race-test runs in isolation all green (T2 alternation is deterministic). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
7ead3581ab |
mbproxy: Wave 3 cleanups, docs, and test gaps from 2026-05-14 review
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).
Code cleanups:
* PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
conversion uses Stopwatch ticks two lines below).
* UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
== count ? false : false` ternary; both branches were `false`.
* InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
`AttachOrCreate` and made `void`. Test sites updated to drop the dead
`bool ok = ...; ok.ShouldBeTrue();` assertions.
* BcdCodec.HasBadNibble promoted from private to internal; the duplicate
copy in BcdPduPipeline removed and the call sites updated to
`BcdCodec.HasBadNibble`.
* PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
100 ms. Now both agree.
* StatusSnapshotBuilder: simplified the unreachable
`RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
`RemoteEp?.ToString() ?? "?"`.
* Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
description of where Polly is used (BackendConnect + ListenerRecovery).
Doc updates:
* README: added a callout about the unconventional 32-bit BCD wire format
("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
integrators using off-the-shelf clients learn about the silent-corruption
hazard before configuring writes.
* docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
semantics — "miss" means "did not find a fresh entry / did not coalesce",
NOT "produced a backend round-trip". Operators wanting actual backend
traffic should compute `miss − coalescedHit − exception04`.
* docs/Architecture/ResponseCache.md: documented the structural
"skip invalidation while recovering" gating (no backend reader during
recovery → no FC06/FC16 response → no invalidation).
* docs/Operations/Configuration.md: noted that the Event Log sink is the
custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
* docs/plan/README.md: added a Phase 12 row pointing at the remediation
plan and linking out to codereviews/2026-05-14/.
Test additions (W3 high-value gaps):
* BcdPduPipelineTests:
- FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
(symmetric inverse of the existing low-side partial-overlap test).
- FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
(mixed-slot routing in a single FC03 read).
- FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
carries no register data; rewriter must pass through).
* AdminEndpointTests:
- NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
PATCH against `/` and `/status.json` must return 405; guards against
an accidental MapPost being added later).
* HotReloadE2ETests:
- E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
reload path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
e66b17fe5f |
mbproxy: Wave 2 fixes from 2026-05-14 code review
Resolves the 21 Major findings catalogued in
codereviews/2026-05-14/RemediationPlan.md (Wave 2). Tests: 370 pass / 0 fail
(baseline 363 + 7 new W2 regression tests).
Multiplexer / concurrency:
W2.1 ConfigReconciler.Attach now threads the live coalescingAccessor through
to add/restart-built supervisors so a hot-reload of
ReadCoalescing.{Enabled,MaxParties} propagates to PLCs added or
restarted via reload.
W2.2 PlcMultiplexer._disposed and UpstreamPipe._disposed are now volatile
for ARM/portability defense.
W2.3 ProxyWorker._supervisors / ConfigReconciler._supervisors switched from
Dictionary to ConcurrentDictionary; reconciler uses TryRemove. The
outer Apply is serialised by a semaphore but the inner Add/Remove/
Restart Task.WhenAll continuations run in parallel.
W2.4 Counter parity for cache miss + coalescing-saturation miss documented
inline (per-design contract; behavior unchanged).
W2.5 _disposeCts.Dispose() and _connectGate.Dispose() guarded against late
watchdog ticks.
W2.6 _connectGate disposed in DisposeAsync.
W2.7 Inline doc clarifying the post-rewriter FC byte read.
Cache / hot-reload:
W2.8 PlcListenerSupervisor.ReplaceContextAsync now calls Clear() to capture
the entry count, emits mbproxy.cache.flushed, then disposes the old
cache. Previously the event was defined but never emitted.
W2.9 Inline doc explaining the implicit "skip cache invalidation while
recovering" gating (no backend reader during recovery → no FC06/FC16
response → no invalidation).
W2.10 ReloadValidator now re-checks resolved per-tag CacheTtlMs against
Cache.AllowLongTtl after BcdTagMapBuilder folds the per-PLC default.
BCD rewriter:
W2.11 Duplicate addresses detected within Global itself and within the per-PLC
Add list itself, BEFORE the working dictionary collapses keys. Cross-list
collisions (Global vs Add) remain the documented width-override pattern.
Previously the DuplicateAddress error was unreachable dead code.
W2.12 OverlappingHighRegister reports each colliding pair exactly once
(canonicalised low/high pair tracked in a HashSet).
W2.13 FC16 32-bit write rejects clientLow > 9999 or clientHigh > 9999 BEFORE
the high*10000+low reconstruction. Without this guard, (high=9999,
low=9999) silently re-encoded as (high=9998, low=9999), losing 1 from
the high word.
W2.14 FC16 validates pdu.Length >= 6 + qty*2 upfront — no half-rewritten
requests when a malformed client claims more registers than it ships.
Supervisor:
W2.15 WaitForInitialBindAttemptAsync now backed by TaskCompletionSource
instead of 10ms busy-poll. Resolves race against fast Stopped→Bound→
Stopped transitions and hangs when the supervisor task throws.
W2.16 StartAsync refuses re-entry on a non-Stopped supervisor (was leaking
the previous _supervisorCts).
W2.17 New TransitionTo helper writes _state, _lastBindError, and (optionally)
_recoveryAttempts under one lock. Snapshot() reads under the same lock
so the status page never reports an inconsistent triple. Truncate
helper extracted (was copy-pasted across three sites).
W2.18 MbproxyOptionsValidator + ReloadValidator reject Connection.{Backend
ConnectTimeoutMs, BackendRequestTimeoutMs, GracefulShutdownTimeoutMs}
<= 0. Misconfigured 0 produces immediate CancelAfter(0) failures.
Hosting / diagnostics:
W2.20 ProxyWorker.StopAsync supervisor-stop deadline now reads from
IOptionsMonitor.CurrentValue.Connection.GracefulShutdownTimeoutMs
(was hard-coded 5s).
W2.21 src/Mbproxy/appsettings.json deleted; the published file is now a Link
to install/mbproxy.config.template.json so the binary ships with a
usable, fully-commented example config instead of an empty stub. Tests
strip the inherited file from their bin via an AfterTargets="Build"
Target so they don't pick up the template's example PLCs.
W2.22 invalidBcdWarnings (PlcPdusStatus) and codeOther (ExceptionCounts)
added to StatusDto, plumbed through StatusSnapshotBuilder, surfaced
in StatusHtmlRenderer table cells.
W2.23 EventLogBridge caches EventLog.SourceExists at construction so Emit
doesn't hit the registry on every Error+ log line.
New regression tests:
ReloadValidatorTests:
Validate_PerTagCacheTtl_Above60s_Without_AllowLongTtl_Fails
Validate_PerTagCacheTtl_Above60s_With_AllowLongTtl_Passes
Validate_ResolvedTtl_FromPerPlcDefault_AboveCap_Fails
Validate_ZeroBackendConnectTimeoutMs_Fails
Validate_NegativeGracefulShutdownTimeoutMs_Fails
BcdPduPipelineTests:
FC16_32Bit_ClientHighOrLowAbove9999_PassesThroughRaw_WithInvalidBcdWarning
FC16_TruncatedRegisterData_PassesThroughRaw_NoPartialRewrite
Reworked tests in BcdTagMapBuilderTests for the W2.11 contract (Global dup,
Add dup, Add-overrides-Global accepted as width override).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
ce32c5cee8 |
mbproxy: Wave 1 fixes from 2026-05-14 code review
Resolves the four critical correctness defects + the ShutdownCoordinator double-stop ordering bug called out in codereviews/2026-05-14/Overview.md. Tests: 362 pass / 0 fail (baseline 358 + 4 new W1 regression tests). W1.1 — Context swap on running multiplexer. PlcMultiplexer._ctx becomes volatile with a new ReplaceContext() method that re-registers the cache stats provider on the (preserved) counters. PlcListener exposes its multiplexer; PlcListenerSupervisor.ReplaceContextAsync swaps the running mux first, then disposes the old cache. Hot-reload tag-list changes and the cache-flush-on-reload contract now actually take effect on the next PDU instead of waiting for the next listener fault. W1.2 — Coalescing factory leak. When the InFlightByKey factory soft-fails (allocator saturation or duplicate TxId), the cleanup path now TryRemoves the stub and walks every party on it (including late attachers) to deliver Modbus exception 0x04. Previously only the leader got the exception; late attachers waited forever for a response that no backend round-trip would ever fire. W1.3 — Backend-reader head-of-line block. UpstreamPipe gains TrySendResponse for non-blocking enqueue. The per-PLC backend reader's fan-out loop uses it instead of awaiting SendResponseAsync, so a wedged upstream's full bounded response channel can no longer stall the single backend reader and starve every other client on that PLC. New responseDropForFullUpstream counter on ProxyCounters / CounterSnapshot records the drops. W1.4 — Stranded outbound frames after cascade. TearDownBackendAsync acquires _connectGate and drains any frames left in _outboundChannel after the writer task faulted/cancelled, releasing their proxy TxIds back to the allocator. Without this, a fresh EnsureBackendConnectedAsync racing the cascade would send stranded frames with old TxIds onto the new backend socket; the responses would arrive with no correlation entry and the upstream peers would hang on the watchdog until BackendRequestTimeoutMs. W1.5 — Delete ShutdownCoordinator (Option B). Drain logic moved into ProxyWorker.StopAsync. AdminEndpointHost is no longer registered as IHostedService; ProxyWorker drives its lifecycle directly so admin starts after listeners are bound and stops AFTER the in-flight drain (the design's documented contract). Admin is resolved lazily in ExecuteAsync to break the circular DI graph (Admin -> StatusSnapshotBuilder -> ProxyWorker). GracefulShutdownTimeoutMs is now read fresh from IOptionsMonitor.CurrentValue at stop time, so a hot-reloaded value is honoured. Removes ShutdownCoordinator + tests. New tests: PlcMultiplexerTests.ReplaceContext_NewTagMap_VisibleOnNextPdu PlcMultiplexerTests.ReplaceContext_NewCache_NextReadGoesToBackend_NotOldCache UpstreamPipeTests.TrySendResponse_WhenChannelFull_ReturnsFalse_WithoutBlocking UpstreamPipeTests.TrySendResponse_AfterDispose_ReturnsFalse Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
1db900edef |
mbproxy: add opt-in response cache (Phase 11)
Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.
Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).
New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
mbproxy.cache.* names).
Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).
Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).
Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.
Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.
PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.
Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
overlapping, non-overlapping write does not invalidate, multi-tag
TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
* Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
* TTL expiry path with sleep > TTL.
* Write invalidation through the proxy on a scratch register.
* BCD-decoded bytes are cached, not raw BCD nibbles.
* Regression: Cache disabled by default -> behaviour byte-identical to
Phase 10.
Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.
Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
||
|
|
a2dba4bd07 |
mbproxy: add in-flight read coalescing (Phase 10)
When two or more upstream clients send the same FC03/FC04 read while a matching request is already in flight on the same PLC's multiplexed backend socket, attach the late arrivals to the existing InFlightRequest .InterestedParties list instead of opening a second backend round-trip. The single backend response fans out to every attached party with each party's original MBAP TxId restored individually. Zero post-response staleness — coalescing operates entirely within the in-flight window (microseconds to ~10 ms typical); the proxy is NOT a cache layer. Headline mechanism: - New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus tables and never share a key; different unit IDs never coalesce; writes (FC06/FC16) bypass the coalescing path entirely. - InFlightByKeyMap uses a simple lock around a Dictionary; atomic TryAttachOrCreate either appends a new party to the in-flight request's mutable List<InterestedParty> or invokes a factory to build a fresh entry. Per-entry MaxParties cap (default 32) bounds fan-out cost; past the cap, the next arrival opens a new entry. - PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The factory closure does the Phase-9 work (allocate TxId, add to CorrelationMap); the channel send happens AFTER returning from TryAttachOrCreate so the map lock is not held across the async send. - Response fan-out in RunBackendReaderAsync removes the entry from InFlightByKeyMap before iterating InterestedParties, ensuring no concurrent attach can mutate the list during iteration. - Cascade + watchdog paths also drain the key map so a stale entry cannot outlive its backend round-trip. Counter accounting balance (per snapshot): CoalescedHitCount + CoalescedMissCount equals total FC03 + FC04 requests since startup. Even with coalescing disabled, every read still bumps Miss so dashboard math stays balanced. New surface (additive only): - src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs - src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs - src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs - ReadCoalescingOptions on ResilienceOptions - CoalescedHitCount / CoalescedMissCount / CoalescedResponseToDeadUpstream counters surfaced on /status.json per PLC and as a compact "Coal" cell on the HTML status page. Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire previously read the same register from both clients (which now coalesces). Patched to read two different addresses so the test still proves distinct backend TxIds without violating the coalescing contract. Tests added: 24 new (19 unit + 5 E2E): - CoalescingKeyTests (5) - InFlightByKeyMapTests (6, includes concurrent stress) - ReadCoalescingTests (8, stub-backend with deterministic delay) - ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active during overlap is proven against the stub, not the sim, due to pymodbus 3.13's known concurrent-frame bug) Total: 325 tests passing (282 unit + 43 E2E). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |
||
|
|
56eee3c563 |
mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the production-ready single-listener / 1:1-backend transparent Modbus TCP proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260 fleet. Phase 9 replaces the connection layer with a single backend socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Phase 9 additions of note: - PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap - InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing for Phase 10 read coalescing — do not collapse to a single field) - Per-request watchdog: surfaces Modbus exception 0x0B to upstream on BackendRequestTimeoutMs, defending against lost responses, dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed- request bug (its ServerRequestHandler.last_pdu state race) - Status DTO + HTML gain inFlight / maxInFlight / txIdWraps / disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md) Tests: 263 unit + 38 E2E. Multiplexer correctness under truly concurrent backend traffic is proved against a stub backend in PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus 3.13's single-PDU framer stays in known-good mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> |