Commit Graph

14 Commits

Author SHA1 Message Date
Joseph Doherty 374eecd205 mbproxy: fix the dashboard's C2/M-series review findings
Closes the on-demand-capture leak cluster from the code review. The capture's armed state was driven off SignalR's ConnectionId, which changes on every transport reconnect, so a reconnect-during-view leaked a subscriber and left the capture armed forever with no viewer. PlcSubscriptionTracker now keys on a stable per-page-load tabId, and StatusBroadcaster reconciles capture arm state from the live viewer set each push cycle — making arming single-threaded and reconnect-safe. Also fixes the TagValueCapture disarm-vs-record race, the bind-failure broadcaster/listener leak, removes dead JSON-context code, and reworks the frontend cold-start retry plus an unknown-PLC watchdog. Adds tracker / broadcaster-loop / race / wire-shape test coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 16:12:43 -04:00
Joseph Doherty 554b05d28c mbproxy: fix dashboard review findings, add named BCD tags + fleet config
Reviewed the new SignalR dashboard and fixed its two top findings: a stored XSS on the connection-detail page (unescaped tag name / direction / timestamp rendered into innerHTML) and FC03/FC04 cache hits bypassing the debug-view capture, which left cached tags frozen while their age climbed. Also adds an optional human-friendly Name to BCD tags surfaced on the debug view, and loads the real fleet config from tags.txt (12 named BCD tags, PLC Z28061) so the published appsettings.json is deploy-ready.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-16 03:39:39 -04:00
Joseph Doherty e719dd51c1 mbproxy: replace status page with a live SignalR web dashboard
The single auto-refreshing zero-JS status page gave operators a 25-column
wall and no way to drill into one connection. This adds a Bootstrap fleet
dashboard (filterable/sortable KPI table) and a per-PLC detail page with a
real-time debug view of raw PLC-side BCD vs. decoded client-side values,
streamed live over a SignalR feed. The debug view is fed by an on-demand
per-tag value capture, armed only while a detail page is open. All assets
(Bootstrap, SignalR client, fonts) are embedded so the UI works unchanged
on firewalled networks; GET /status.json is untouched for scrapers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 10:41:02 -04:00
Joseph Doherty b330faff03 mbproxy: cross-platform support — Linux/systemd alongside Windows
Make the service build, run, and install on Linux as a first-class
target while keeping the Windows Service + Event Log behaviour intact.

- Build: drop the hardcoded win-x64 RID — single-file publish now works
  for any RID. publish.ps1 gains -Rid; new publish.sh for Linux hosts.
- Diagnostics: DiagnosticSinkSelector picks the Error+ sink per host —
  Windows Event Log under the SCM, local syslog under systemd
  (Serilog.Sinks.SyslogMessages), none for interactive runs. The
  EventLog truncation helper is extracted so it is testable cross-OS.
- Host: Program.cs registers AddSystemd() alongside AddWindowsService().
- Config: a RID-conditioned appsettings template ships Windows or Unix
  paths; both templates are schema-validated by a test.
- Install: systemd unit (Type=exec) plus install.sh / uninstall.sh.
  Also fixes two cross-platform bugs found while testing: install.ps1
  and uninstall.ps1 used New-EventLog / Remove-EventLog (absent in
  PowerShell 7), and the E2E sim launcher hardcoded Windows venv paths.
- Docs updated across README, CLAUDE.md, and docs/ for dual-platform.

413 tests pass on Windows; 374 (all non-simulator) on Linux.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:41:59 -04:00
Joseph Doherty 0868613890 mbproxy: add keepalive / connection monitoring
The DL205/DL260 ECOM emits no TCP keepalives, so an idle backend socket
can be silently dropped by a middlebox (switch, firewall, NAT) after
2-5 minutes. Enable OS SO_KEEPALIVE on backend and accepted upstream
sockets, and drive a periodic synthetic FC03 heartbeat on each idle
backend socket so a dead path is detected before a real client request
hits it. Controlled by Connection.Keepalive (ON by default).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 09:40:54 -04:00
Joseph Doherty 7466a46aa7 mbproxy/docs: retire superseded design/plan docs and dissolve DL260/
The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).

All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:37:48 -04:00
Joseph Doherty 7ead3581ab mbproxy: Wave 3 cleanups, docs, and test gaps from 2026-05-14 review
Closes the Wave 3 (cleanup) tier of codereviews/2026-05-14/RemediationPlan.md.
Tests: 378 pass / 0 fail (baseline 370 + 8 new W3 regression tests).

Code cleanups:
  * PlcMultiplexer: removed dead `elapsedMs` calculation (the actual EWMA
    conversion uses Stopwatch ticks two lines below).
  * UpstreamPipe.FillAsync: dropped the meaningless `firstRead && remaining
    == count ? false : false` ternary; both branches were `false`.
  * InFlightByKeyMap.TryAttachOrCreate (always returned `true`) renamed to
    `AttachOrCreate` and made `void`. Test sites updated to drop the dead
    `bool ok = ...; ok.ShouldBeTrue();` assertions.
  * BcdCodec.HasBadNibble promoted from private to internal; the duplicate
    copy in BcdPduPipeline removed and the call sites updated to
    `BcdCodec.HasBadNibble`.
  * PlcMultiplexer watchdog comment fixed: said "1-second floor", code uses
    100 ms. Now both agree.
  * StatusSnapshotBuilder: simplified the unreachable
    `RemoteEp?.ToString() ?? RemoteEp?.Address.ToString() ?? "?"` to
    `RemoteEp?.ToString() ?? "?"`.
  * Mbproxy.csproj: stale "deferred" Polly comment replaced with a real
    description of where Polly is used (BackendConnect + ListenerRecovery).

Doc updates:
  * README: added a callout about the unconventional 32-bit BCD wire format
    ("two base-10000 digits in CDAB", not standard binary CDAB Int32) so
    integrators using off-the-shelf clients learn about the silent-corruption
    hazard before configuring writes.
  * docs/design.md: clarified `cacheMissCount` and `coalescedMissCount`
    semantics — "miss" means "did not find a fresh entry / did not coalesce",
    NOT "produced a backend round-trip". Operators wanting actual backend
    traffic should compute `miss − coalescedHit − exception04`.
  * docs/Architecture/ResponseCache.md: documented the structural
    "skip invalidation while recovering" gating (no backend reader during
    recovery → no FC06/FC16 response → no invalidation).
  * docs/Operations/Configuration.md: noted that the Event Log sink is the
    custom EventLogBridge, not Serilog.Sinks.EventLog (W2.23 cached check).
  * docs/plan/README.md: added a Phase 12 row pointing at the remediation
    plan and linking out to codereviews/2026-05-14/.

Test additions (W3 high-value gaps):
  * BcdPduPipelineTests:
    - FC16_WriteStartsOnHighWord_Of32BitPair_PassesThroughRaw_WithPartialWarning
      (symmetric inverse of the existing low-side partial-overlap test).
    - FC03_Mixed_16Bit_32Bit_AndNonBcd_InOneRead_OnlyConfiguredSlotsRewritten
      (mixed-slot routing in a single FC03 read).
    - FC16_Response_PassesThroughUnchanged_RegardlessOfTagMap (FC16 response
      carries no register data; rewriter must pass through).
  * AdminEndpointTests:
    - NonGetMethod_AgainstAdminRoutes_Returns405 (Theory: POST/PUT/DELETE/
      PATCH against `/` and `/status.json` must return 405; guards against
      an accidental MapPost being added later).
  * HotReloadE2ETests:
    - E2E_TagListReload_OnCacheablePlc_EmitsCacheFlushedEvent (validates the
      W2.8 cache.flushed wiring end-to-end via the real FileSystemWatcher
      reload path).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 06:06:52 -04:00
Joseph Doherty b4c82bf379 mbproxy/docs: slim operations.md to runbook content + pointers
Three sections in operations.md duplicated the new focused docs:
  - "Configuration" → Operations/Configuration.md + Features/HotReload.md
  - "Status page"   → Operations/StatusPage.md
  - "Common failure modes" → Operations/Troubleshooting.md + Reference/LogEvents.md

Replaced each with a short pointer block. The runbook now keeps only
content unique to day-two ops: install steps, upgrade procedure, uninstall,
log file locations / retention / archival, and the first-install smoke
checklist. 271 -> 176 lines.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:49:34 -04:00
Joseph Doherty 0eb560a7f6 mbproxy/docs: slim design.md log-event section to a pointer
The 24-row event-name table in design.md duplicated docs/Reference/LogEvents.md
and had drifted out of date (config.reload.applied and .rejected listed wrong
property templates; LogContext.PushProperty scoping described but not actually
used in source). LogEvents.md is the authoritative catalog now; design.md
keeps the logging-architecture rationale (sinks, naming convention, stability
contract) and routes to LogEvents.md for the names.

Also corrects the property-emission description: properties go through
[LoggerMessage] templates, not LogContext.PushProperty.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:45:39 -04:00
Joseph Doherty f49e27e316 mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:44:34 -04:00
Joseph Doherty 1db900edef mbproxy: add opt-in response cache (Phase 11)
Layers a per-PLC, per-tag response cache on top of Phase 10's coalescing.
Cache is OFF by default per tag (CacheTtlMs = 0); a fresh deployment with no
TTL config behaves identically to Phase 10. Operators opt tags in by setting
CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a
PlcOptions entry), explicitly acknowledging the staleness window.

Cache lookup order: cache -> coalesce -> backend. A cache hit short-circuits
both Phase 10's coalescing path and Phase 9's backend send. Cache stores
POST-rewriter PDU bytes so hits never re-invoke the BCD rewriter. FC06/FC16
write responses invalidate every cached entry whose address range overlaps
the write (half-open interval math).

New types (Mbproxy.Proxy.Cache, all internal):
- CacheKey (record-struct, same shape as CoalescingKey but kept SEPARATE so
  the two phases evolve independently).
- CacheEntry, ResponseCache (IDisposable; LRU + PeriodicTimer eviction
  loop), CacheInvalidator (pure overlap matcher), CacheLogEvents (stable
  mbproxy.cache.* names).

Multi-tag range TTL = min(TTLs); any tag with TTL = 0 in the range disables
caching for the whole read (conservative-by-design).

Options surface:
- BcdTagOptions.CacheTtlMs (nullable int; null = fall through to PLC default)
- PlcOptions.DefaultCacheTtlMs
- MbproxyOptions.Cache.{AllowLongTtl, MaxEntriesPerPlc, EvictionIntervalMs}
- TTL > 60_000 ms requires Cache.AllowLongTtl = true (reload validation).

Admin counters (Tier 1.8 + Tier 2 cache-memory KPIs from docs/kpi.md):
- CacheHitCount, CacheMissCount, CacheInvalidations on ProxyCounters.
- CacheEntryCount, CacheBytes via a new ICacheStatsProvider snapshot path.
- /status.json and the HTML page surface a new Cache cell per PLC row.

Hot-reload: any tag-list change to a PLC reseats the per-PLC context with a
fresh cache; the old cache is disposed inside ReplaceContextAsync. Per-tag
flush granularity is intentionally not implemented in v1.

PLCs with no cache-eligible tags (every resolved tag has CacheTtlMs = 0)
get Cache = null on the context and skip the eviction timer entirely, so
the no-cache path is byte-identical to Phase 10.

Tests (32 new unit + 5 new E2E = 37 new; suite now 314 unit + 48 E2E):
- CacheKeyTests, CacheEntryTests (records + boundary semantics).
- CacheInvalidatorTests: full overlap, both partials, adjacent-not-
  overlapping, disjoint, different unit ID + auxiliary FC-filter / zero-qty.
- ResponseCacheTests: round-trip, lazy expiry, range invalidation,
  unit-id filter, LRU bound, LRU access tracking, concurrent get/set,
  dispose, clear, approximate-bytes accounting.
- ResponseCacheMultiplexerTests (stub-backend): hit short-circuits
  coalescing, BCD-decoded bytes are cached not raw, FC06 invalidates
  overlapping, non-overlapping write does not invalidate, multi-tag
  TTL=min rule, regression-cache-disabled-by-default-is-Phase-10, hit
  works even when backend unreachable.
- ResponseCacheE2ETests (pymodbus DL205 sim, sequential reads):
  * Headline: 10 reads with TTL=1000 ms -> 9 hits, 1 miss, 1 backend trip.
  * TTL expiry path with sleep > TTL.
  * Write invalidation through the proxy on a scratch register.
  * BCD-decoded bytes are cached, not raw BCD nibbles.
  * Regression: Cache disabled by default -> behaviour byte-identical to
    Phase 10.

Pre-existing flake hardened: BackendDisconnect_CascadesToAllUpstreams now
polls briefly for the cascade counter to absorb the inherent scheduling
gap between "upstream EOF observed" and "counter incremented inside
TearDownBackendAsync." Counter semantics unchanged.

Phase doc updated with implementation clarifications discovered during
this work (CacheKey kept separate from CoalescingKey, LastUsedTick is
long, FC06/FC16 startAddr/qty parsing extension, cache-pre-connect
short-circuit, write-invalidation only on successful responses).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:08:51 -04:00
Joseph Doherty 892b10baf4 mbproxy/docs: pivot design contract for Phase 11 response cache
Lands the design-contract pivot ahead of any cache implementation code so
reviewers can evaluate the change to the "purely transparent proxy" stance
independently of the Phase-11 code that depends on it.

- docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes
  sections to acknowledge the opt-in cache; add new "Response cache (Phase
  11)" section covering lookup order (cache -> coalesce -> backend), multi-
  tag range TTL = min, post-rewriter storage, address-range-overlap write
  invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate,
  and LRU-bounded capacity. Extend log event table with mbproxy.cache.*
  events. Extend per-PLC status field table with cacheHitCount /
  cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes.
  Extend hot-reload propagation table with CacheTtlMs / Cache.* rows.
- docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11"
  to "shipped in Phase 11" and add Tier 2.4a cache-memory section.
- CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture
  headline bullets to reflect the transparent-by-default + opt-in-cache
  contract; flip "Implementation complete through Phase 10" to "through
  Phase 11".
- install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache
  block and a CacheTtlMs example on a BcdTags.Global entry, with prominent
  staleness commentary documenting the design contract.

No code changes in this commit - implementation lands in a follow-up.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:35:35 -04:00
Joseph Doherty a2dba4bd07 mbproxy: add in-flight read coalescing (Phase 10)
When two or more upstream clients send the same FC03/FC04 read while a
matching request is already in flight on the same PLC's multiplexed
backend socket, attach the late arrivals to the existing InFlightRequest
.InterestedParties list instead of opening a second backend round-trip.
The single backend response fans out to every attached party with each
party's original MBAP TxId restored individually. Zero post-response
staleness — coalescing operates entirely within the in-flight window
(microseconds to ~10 ms typical); the proxy is NOT a cache layer.

Headline mechanism:

- New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys
  the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus
  tables and never share a key; different unit IDs never coalesce;
  writes (FC06/FC16) bypass the coalescing path entirely.
- InFlightByKeyMap uses a simple lock around a Dictionary; atomic
  TryAttachOrCreate either appends a new party to the in-flight
  request's mutable List<InterestedParty> or invokes a factory to
  build a fresh entry. Per-entry MaxParties cap (default 32) bounds
  fan-out cost; past the cap, the next arrival opens a new entry.
- PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for
  FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The
  factory closure does the Phase-9 work (allocate TxId, add to
  CorrelationMap); the channel send happens AFTER returning from
  TryAttachOrCreate so the map lock is not held across the async send.
- Response fan-out in RunBackendReaderAsync removes the entry from
  InFlightByKeyMap before iterating InterestedParties, ensuring no
  concurrent attach can mutate the list during iteration.
- Cascade + watchdog paths also drain the key map so a stale entry
  cannot outlive its backend round-trip.

Counter accounting balance (per snapshot): CoalescedHitCount +
CoalescedMissCount equals total FC03 + FC04 requests since startup.
Even with coalescing disabled, every read still bumps Miss so dashboard
math stays balanced.

New surface (additive only):
- src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs
- src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs
- src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs
- ReadCoalescingOptions on ResilienceOptions
- CoalescedHitCount / CoalescedMissCount /
  CoalescedResponseToDeadUpstream counters surfaced on /status.json
  per PLC and as a compact "Coal" cell on the HTML status page.

Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire
previously read the same register from both clients (which now
coalesces). Patched to read two different addresses so the test still
proves distinct backend TxIds without violating the coalescing
contract.

Tests added: 24 new (19 unit + 5 E2E):
- CoalescingKeyTests (5)
- InFlightByKeyMapTests (6, includes concurrent stress)
- ReadCoalescingTests (8, stub-backend with deterministic delay)
- ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active
  during overlap is proven against the stub, not the sim, due to
  pymodbus 3.13's known concurrent-frame bug)

Total: 325 tests passing (282 unit + 43 E2E).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 02:26:06 -04:00
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00