Fixes every finding from the codereviews/2026-05-16 multi-agent review
(2 Critical, 20 Major, 38 Minor) and adds that review to the repo.
Highlights: dashboard XSS escape; response cache invalidated on the
write request (not just the response); ReloadValidator now runs at
startup so port collisions / duplicate names / malformed Resilience
profiles fail fast; AdminPort 0 genuinely disables the admin endpoint;
PlcListener accept-loop faults propagate to the supervisor's faulted
path; reconciler Restart builds before removing; Resilience pipelines
are restart-only from a frozen snapshot; multiplexer connect-race leak,
watchdog party-list snapshot, backend-response and FC16 framing
validation; frontend reconnect retry and util.js load guard; plus the
log-event/doc drift sweep and test-port hygiene.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Resolves the remaining Minor items from the 2026-05-15 review so the
web-UI dashboard work has no open follow-ups: a real-HubConnection
end-to-end test for the SignalR feed, stable mbproxy.admin.broadcast.*
log-event names, keyboard/aria accessibility on the fleet table,
frontend JS hardening (URL-decode guard, NaN guards, shared util.js),
reconciler<->capture-registry coverage, throwing-sink and embedded-asset
tests, broadcaster polish, and a soft upper bound on AdminPushIntervalMs.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the on-demand-capture leak cluster from the code review. The capture's armed state was driven off SignalR's ConnectionId, which changes on every transport reconnect, so a reconnect-during-view leaked a subscriber and left the capture armed forever with no viewer. PlcSubscriptionTracker now keys on a stable per-page-load tabId, and StatusBroadcaster reconciles capture arm state from the live viewer set each push cycle — making arming single-threaded and reconnect-safe. Also fixes the TagValueCapture disarm-vs-record race, the bind-failure broadcaster/listener leak, removes dead JSON-context code, and reworks the frontend cold-start retry plus an unknown-PLC watchdog. Adds tracker / broadcaster-loop / race / wire-shape test coverage.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The single auto-refreshing zero-JS status page gave operators a 25-column
wall and no way to drill into one connection. This adds a Bootstrap fleet
dashboard (filterable/sortable KPI table) and a per-PLC detail page with a
real-time debug view of raw PLC-side BCD vs. decoded client-side values,
streamed live over a SignalR feed. The debug view is fed by an on-demand
per-tag value capture, armed only while a detail page is open. All assets
(Bootstrap, SignalR client, fonts) are embedded so the UI works unchanged
on firewalled networks; GET /status.json is untouched for scrapers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.
- Build: drop the hardcoded win-x64 RID — single-file publish now works
for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
Windows Event Log under the SCM, local syslog under systemd
(Serilog.Sinks.SyslogMessages), none for interactive runs. The
EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
Also fixes two cross-platform bugs found while testing: install.ps1
and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.
413 tests pass on Windows; 374 (all non-simulator) on Linux.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).
All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).
Code cleanups:
* PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
conversion uses Stopwatch ticks two lines below).
* UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
== count ? false : false` ternary; both branches were `false`.
* InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
`AttachOrCreate` and made `void`. Test sites updated to drop the dead
`bool ok = ...; ok.ShouldBeTrue();` assertions.
* BcdCodec.HasBadNibble promoted from private to internal; the duplicate
copy in BcdPduPipeline removed and the call sites updated to
`BcdCodec.HasBadNibble`.
* PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
100 ms. Now both agree.
* StatusSnapshotBuilder: simplified the unreachable
`RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
`RemoteEp?.ToString() ?? "?"`.
* Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
description of where Polly is used (BackendConnect + ListenerRecovery).
Doc updates:
* README: added a callout about the unconventional 32-bit BCD wire format
("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
integrators using off-the-shelf clients learn about the silent-corruption
hazard before configuring writes.
* docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
semantics — "miss" means "did not find a fresh entry / did not coalesce",
NOT "produced a backend round-trip". Operators wanting actual backend
traffic should compute `miss − coalescedHit − exception04`.
* docs/Architecture/ResponseCache.md: documented the structural
"skip invalidation while recovering" gating (no backend reader during
recovery → no FC06/FC16 response → no invalidation).
* docs/Operations/Configuration.md: noted that the Event Log sink is the
custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
* docs/plan/README.md: added a Phase 12 row pointing at the remediation
plan and linking out to codereviews/2026-05-14/.
Test additions (W3 high-value gaps):
* BcdPduPipelineTests:
- FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
(symmetric inverse of the existing low-side partial-overlap test).
- FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
(mixed-slot routing in a single FC03 read).
- FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
carries no register data; rewriter must pass through).
* AdminEndpointTests:
- NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
PATCH against `/` and `/status.json` must return 405; guards against
an accidental MapPost being added later).
* HotReloadE2ETests:
- E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
reload path).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three sections in operations.md duplicated the new focused docs:
- "Configuration" → Operations/Configuration.md + Features/HotReload.md
- "Status page" → Operations/StatusPage.md
- "Common failure modes" → Operations/Troubleshooting.md + Reference/LogEvents.md
Replaced each with a short pointer block. The runbook now keeps only
content unique to day-two ops: install steps, upgrade procedure, uninstall,
log file locations / retention / archival, and the first-install smoke
checklist. 271 -> 176 lines.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The 24-row event-name table in design.md duplicated docs/Reference/LogEvents.md
and had drifted out of date (config.reload.applied and .rejected listed wrong
property templates; LogContext.PushProperty scoping described but not actually
used in source). LogEvents.md is the authoritative catalog now; design.md
keeps the logging-architecture rationale (sinks, naming convention, stability
contract) and routes to LogEvents.md for the names.
Also corrects the property-emission description: properties go through
[LoggerMessage] templates, not LogContext.PushProperty.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.
Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).
New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
mbproxy.cache.* names).
Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).
Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).
Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.
Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.
PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.
Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
overlapping, non-overlapping write does not invalidate, multi-tag
TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
* Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
* TTL expiry path with sleep > TTL.
* Write invalidation through the proxy on a scratch register.
* BCD-decoded bytes are cached, not raw BCD nibbles.
* Regression: Cache disabled by default -> behaviour byte-identical to
Phase 10.
Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.
Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Lands the design-contract pivot ahead of any cache implementation code so
reviewers can evaluate the change to the "purely transparent proxy" stance
independently of the Phase-11 code that depends on it.
- docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes
sections to acknowledge the opt-in cache; add new "Response cache (Phase
11)" section covering lookup order (cache -> coalesce -> backend), multi-
tag range TTL = min, post-rewriter storage, address-range-overlap write
invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate,
and LRU-bounded capacity. Extend log event table with mbproxy.cache.*
events. Extend per-PLC status field table with cacheHitCount /
cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes.
Extend hot-reload propagation table with CacheTtlMs / Cache.* rows.
- docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11"
to "shipped in Phase 11" and add Tier 2.4a cache-memory section.
- CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture
headline bullets to reflect the transparent-by-default + opt-in-cache
contract; flip "Implementation complete through Phase 10" to "through
Phase 11".
- install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache
block and a CacheTtlMs example on a BcdTags.Global entry, with prominent
staleness commentary documenting the design contract.
No code changes in this commit - implementation lands in a follow-up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When two or more upstream clients send the same FC03/FC04 read while a
matching request is already in flight on the same PLC's multiplexed
backend socket, attach the late arrivals to the existing InFlightRequest
.InterestedParties list instead of opening a second backend round-trip.
The single backend response fans out to every attached party with each
party's original MBAP TxId restored individually. Zero post-response
staleness — coalescing operates entirely within the in-flight window
(microseconds to ~10 ms typical); the proxy is NOT a cache layer.
Headline mechanism:
- New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys
the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus
tables and never share a key; different unit IDs never coalesce;
writes (FC06/FC16) bypass the coalescing path entirely.
- InFlightByKeyMap uses a simple lock around a Dictionary; atomic
TryAttachOrCreate either appends a new party to the in-flight
request's mutable List<InterestedParty> or invokes a factory to
build a fresh entry. Per-entry MaxParties cap (default 32) bounds
fan-out cost; past the cap, the next arrival opens a new entry.
- PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for
FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The
factory closure does the Phase-9 work (allocate TxId, add to
CorrelationMap); the channel send happens AFTER returning from
TryAttachOrCreate so the map lock is not held across the async send.
- Response fan-out in RunBackendReaderAsync removes the entry from
InFlightByKeyMap before iterating InterestedParties, ensuring no
concurrent attach can mutate the list during iteration.
- Cascade + watchdog paths also drain the key map so a stale entry
cannot outlive its backend round-trip.
Counter accounting balance (per snapshot): CoalescedHitCount +
CoalescedMissCount equals total FC03 + FC04 requests since startup.
Even with coalescing disabled, every read still bumps Miss so dashboard
math stays balanced.
New surface (additive only):
- src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs
- src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs
- src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs
- ReadCoalescingOptions on ResilienceOptions
- CoalescedHitCount / CoalescedMissCount /
CoalescedResponseToDeadUpstream counters surfaced on /status.json
per PLC and as a compact "Coal" cell on the HTML status page.
Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire
previously read the same register from both clients (which now
coalesces). Patched to read two different addresses so the test still
proves distinct backend TxIds without violating the coalescing
contract.
Tests added: 24 new (19 unit + 5 E2E):
- CoalescingKeyTests (5)
- InFlightByKeyMapTests (6, includes concurrent stress)
- ReadCoalescingTests (8, stub-backend with deterministic delay)
- ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active
during overlap is proven against the stub, not the sim, due to
pymodbus 3.13's known concurrent-frame bug)
Total: 325 tests passing (282 unit + 43 E2E).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.
Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
on BackendRequestTimeoutMs, defending against lost responses,
dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)
Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>