892b10baf4
Lands the design-contract pivot ahead of any cache implementation code so reviewers can evaluate the change to the "purely transparent proxy" stance independently of the Phase-11 code that depends on it. - docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes sections to acknowledge the opt-in cache; add new "Response cache (Phase 11)" section covering lookup order (cache -> coalesce -> backend), multi- tag range TTL = min, post-rewriter storage, address-range-overlap write invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate, and LRU-bounded capacity. Extend log event table with mbproxy.cache.* events. Extend per-PLC status field table with cacheHitCount / cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes. Extend hot-reload propagation table with CacheTtlMs / Cache.* rows. - docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11" to "shipped in Phase 11" and add Tier 2.4a cache-memory section. - CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture headline bullets to reflect the transparent-by-default + opt-in-cache contract; flip "Implementation complete through Phase 10" to "through Phase 11". - install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache block and a CacheTtlMs example on a BcdTags.Global entry, with prominent staleness commentary documenting the design contract. No code changes in this commit - implementation lands in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
333 lines
34 KiB
Markdown
333 lines
34 KiB
Markdown
# mbproxy — design plan
|
||
|
||
Architectural design for the `mbproxy` Modbus TCP proxy service: how it fronts ~54 AutomationDirect DirectLOGIC DL205/DL260 controllers, rewrites BCD tags bidirectionally inline, and recovers from listener and backend failures. Settled in a design Q&A on 2026-05-13.
|
||
|
||
**Status:** plan; no code yet. Each decision below is load-bearing — change deliberately, not by drift.
|
||
|
||
Context (what the service does and why it exists) lives in [`../CLAUDE.md`](../CLAUDE.md) under "What this is" and "Purpose: bidirectional BCD rewrite". This file is the *how*. Device quirks the design depends on live in [`../DL260/dl205.md`](../DL260/dl205.md).
|
||
|
||
Runtime shape: **.NET 10 Generic Host** worker service registered as a **Windows Service** via `Microsoft.Extensions.Hosting.WindowsServices`.
|
||
|
||
## Listener topology — per-PLC port (one port → one PLC)
|
||
|
||
The host opens **one `TcpListener` per PLC** on a distinct port. Upstream clients reach a specific PLC by connecting to its assigned proxy port; no protocol-level routing is needed.
|
||
|
||
```
|
||
Client A ──┐
|
||
Client B ──┼──→ proxy:5020 ──→ PLC #1 (10.0.1.1:502)
|
||
├──→ proxy:5021 ──→ PLC #2 (10.0.1.2:502)
|
||
│ ...
|
||
└──→ proxy:5073 ──→ PLC #54 (10.0.1.54:502)
|
||
```
|
||
|
||
## Connection model — single backend socket per PLC, multiplexed via MBAP TxId rewriting
|
||
|
||
Each PLC has **one persistent backend TCP socket**, owned by a `PlcMultiplexer`. Many upstream client connections share that single backend socket; the multiplexer distinguishes their in-flight requests by **rewriting the MBAP transaction ID** on each request and restoring each client's original TxId on the matching response. Implemented in [Phase 09](plan/09-txid-multiplexing.md); replaced the prior 1:1 per-upstream-client backend-socket model.
|
||
|
||
```
|
||
Client A ─┐
|
||
Client B ─┼─→ proxy:5020 ─[ PlcMultiplexer ]─→ PLC #1 (10.0.1.1:502)
|
||
Client C ─┘ │ (one persistent socket)
|
||
▼
|
||
CorrelationMap[proxyTxId]
|
||
TxIdAllocator (16-bit space)
|
||
```
|
||
|
||
- **Upstream → multiplexer**: each accepted upstream socket is wrapped in an `UpstreamPipe` (read loop + bounded response channel). The pipe's read loop hands every parsed MBAP frame to the multiplexer's `OnUpstreamFrameAsync`, which allocates a free 16-bit `proxyTxId`, stores an `InFlightRequest` in a `CorrelationMap` keyed by that proxyTxId, BCD-rewrites the request payload, overwrites the MBAP header's TxId field with `proxyTxId`, and enqueues the frame into the per-PLC outbound channel.
|
||
- **Multiplexer → backend**: a single backend writer task drains the outbound channel and sends each frame to the PLC over the shared socket. A single backend reader task reads MBAP frames back, looks each up by `proxyTxId` in the correlation map, BCD-rewrites the response, restores each interested party's original TxId, and routes the frame to that party's `UpstreamPipe._responseChannel`. The single-writer / single-reader invariant on the backend socket eliminates the need for socket-level synchronisation.
|
||
- **Per-request timeout watchdog**: a periodic task scans the correlation map at a quarter of `Connection.BackendRequestTimeoutMs` and times out any in-flight request whose response has not arrived. Timed-out requests get a Modbus exception 0x0B (Gateway Target Device Failed To Respond) delivered to their upstream party and free their allocator slot. Without this watchdog, a single lost or mis-routed response would leak a correlation entry forever and hang the upstream pipe indefinitely.
|
||
|
||
**Operational consequence (replaces the prior 4-client warning).** The H2-ECOM100's 4-concurrent-TCP-client cap (see [`../DL260/dl205.md`](../DL260/dl205.md) → Behavioral Oddities) no longer limits upstream-side connection count — the proxy holds exactly one slot per PLC regardless of how many upstream clients are attached. The wire-rate ceiling is unchanged (the ECOM internally serializes requests at ~2–10 ms per scan); the multiplexer shifts where serialization happens (proxy outbound queue vs PLC accept queue) rather than adding throughput.
|
||
|
||
> ⚠ **Backend disconnect cascades upstream.** When the backend socket dies (PLC reboot, network partition, middlebox idle drop), the multiplexer closes every attached upstream pipe in the same cycle and increments `BackendDisconnectCascades` by the upstream count. Clients reconnect on their own next request and the multiplexer Polly-reconnects to the backend on the first upstream frame.
|
||
|
||
> ⚠ **pymodbus 3.13.0 simulator quirk (test-only).** The pymodbus simulator's `ServerRequestHandler` stores a single `last_pdu` per connection and schedules deferred handlers via `asyncio.call_soon`. Two MBAP frames arriving in the same recv buffer (as the multiplexer can produce on its shared backend connection) overwrite `last_pdu` before the first handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not suffer this — it echoes per-request TxIds correctly. Multiplexer correctness under truly concurrent backend traffic is therefore proved against a stub backend in `PlcMultiplexerTests`; the E2E suite paces requests to keep pymodbus in known-good single-PDU mode. The per-request watchdog is the production defence against any backend (real or simulated) that mis-echoes a TxId.
|
||
|
||
## Configuration — single `appsettings.json`
|
||
|
||
All configuration lives in one file, loaded via `Microsoft.Extensions.Configuration` and bound to typed POCOs. No sidecar YAML/CSV.
|
||
|
||
```jsonc
|
||
{
|
||
"Mbproxy": {
|
||
"BcdTags": {
|
||
"Global": [
|
||
{ "Address": 1072, "Width": 16 },
|
||
{ "Address": 1080, "Width": 32 }
|
||
]
|
||
},
|
||
"Plcs": [
|
||
{
|
||
"Name": "Line1-Mixer",
|
||
"ListenPort": 5020,
|
||
"Host": "10.0.1.1",
|
||
"BcdTags": {
|
||
"Add": [ { "Address": 1200, "Width": 32 } ],
|
||
"Remove": [ 1080 ]
|
||
}
|
||
},
|
||
{ "Name": "Line1-Conveyor", "ListenPort": 5021, "Host": "10.0.1.2" }
|
||
// ... 54 PLC rows
|
||
],
|
||
"AdminPort": 8080,
|
||
"Connection": {
|
||
"BackendConnectTimeoutMs": 3000,
|
||
"BackendRequestTimeoutMs": 3000
|
||
},
|
||
"Resilience": {
|
||
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [100, 500, 2000] },
|
||
"ListenerRecovery": { "InitialBackoffMs": [1000, 2000, 5000, 15000, 30000], "SteadyStateMs": 30000 }
|
||
},
|
||
"Cache": {
|
||
"AllowLongTtl": false, // gate for any tag CacheTtlMs > 60_000
|
||
"MaxEntriesPerPlc": 1000,
|
||
"EvictionIntervalMs": 5000
|
||
}
|
||
}
|
||
}
|
||
```
|
||
|
||
A BCD tag may optionally carry `CacheTtlMs` (default 0 = off); a `PlcOptions` entry may optionally carry `DefaultCacheTtlMs` (default 0 = off). Resolution order: explicit per-tag → per-PLC default → 0.
|
||
|
||
**Hybrid tag resolution.** For each PLC, the effective BCD tag list is `Global ∪ Add − Remove`. `Remove` matches by address; if the same address appears in both `Add` and `Global` the `Add` entry wins (this is how a width override is expressed). Validation at startup must:
|
||
|
||
- reject duplicate addresses within a single PLC's resolved list
|
||
- reject 32-bit entries that would have their high register overlap a separate 16-bit entry
|
||
- warn on `Remove` entries that don't match any global tag (probably stale config)
|
||
|
||
## Configuration hot-reload
|
||
|
||
`Microsoft.Extensions.Configuration` loads `appsettings.json` with `reloadOnChange: true`, and all consumers read via `IOptionsMonitor<MbproxyOptions>` so a save to the config file propagates without restarting the service. Each change kind has explicit reconcile semantics:
|
||
|
||
| Change in appsettings | Propagation |
|
||
|-----------------------|-------------|
|
||
| `BcdTags.Global` add/remove/width | Rewriter dereferences the monitor per-PDU. Next PDU sees the new map; in-flight reads/writes are not retroactively touched. |
|
||
| `Plcs[i].BcdTags.{Add,Remove}` | Same — next-PDU resolution. |
|
||
| New `Plcs[i]` entry | Listener supervisor binds the new port subject to the same eager-then-auto-recover policy. |
|
||
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream client connections for that PLC. |
|
||
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
|
||
| `Connection.Backend*TimeoutMs` | Next backend connect/request uses the new value. In-flight operations keep their already-applied timeout. |
|
||
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` (Phase 11) | Tag-map reseat for the affected PLC drops the entire PLC cache; entries re-populate on demand under the new TTL. Per-tag flush granularity is intentionally not implemented in v1. |
|
||
| `Cache.AllowLongTtl`, `Cache.MaxEntriesPerPlc`, `Cache.EvictionIntervalMs` (Phase 11) | `AllowLongTtl` is enforced on next reload-validation; `MaxEntriesPerPlc` applies to subsequent inserts (existing entries not pruned); `EvictionIntervalMs` is read by each fresh eviction loop. |
|
||
| Invalid reload (schema break, duplicate ports, duplicate addresses in a resolved tag list, `CacheTtlMs > 60_000` without `Cache.AllowLongTtl = true`) | Reload is rejected as a whole; current in-memory config stays in effect; `mbproxy.config.reload.rejected` is logged at Error. |
|
||
|
||
Every accepted reload emits `mbproxy.config.reload.applied` at Information with a summary of which PLCs were added/removed and the size of the tag-list delta.
|
||
|
||
## BCD tag shape
|
||
|
||
```csharp
|
||
public sealed record BcdTag(ushort Address, byte Width); // Width ∈ { 16, 32 }
|
||
```
|
||
|
||
- **16-bit BCD** — one register holds 4 BCD digits (0–9999). Wire value `0x1234` decodes to decimal 1234.
|
||
- **32-bit BCD** — a CDAB-ordered register pair at `Address` and `Address+1`. The register at `Address` holds the **low 4 digits**; the register at `Address+1` holds the **high 4 digits**. Decoded decimal = `high * 10000 + low`. This follows directly from DirectLOGIC's CDAB word order (see [`../DL260/dl205.md`](../DL260/dl205.md) → Word Order).
|
||
- **Unsigned only.** DL205/DL260 BCD is non-negative in the default ladder pattern; the proxy does not implement signed BCD.
|
||
- **Holding-register and input-register addresses share the same space.** The rewriter applies the configured tag list against both FC03 and FC04 reads.
|
||
|
||
## Read coalescing (Phase 10)
|
||
|
||
After Phase 10, FC03 / FC04 requests are additionally subject to **in-flight read coalescing** before they reach the backend. When two or more upstream clients send the same `(unitId, fc, startAddress, qty)` tuple within the in-flight window of an already-routed request, the multiplexer attaches each late arrival to the existing `InFlightRequest.InterestedParties` list instead of opening a second backend round-trip. The single backend response is fanned out to every attached party with each party's original MBAP TxId restored individually.
|
||
|
||
Properties:
|
||
|
||
- **Zero post-response staleness.** Coalescing operates entirely between "first request sent to backend" and "response received from backend" (microseconds to ~10 ms typical). Once the response is fanned out, the coalescing entry dies. Coalescing alone is NOT a cache layer — the value each upstream sees is the same value an uncoalesced request would have returned within the PLC's scan-time precision. (Phase 11 layers an opt-in cache on top — see "Response cache" below.)
|
||
- **Only FC03 / FC04.** Writes (FC06 / FC16) are non-idempotent on BCD tags and never coalesced. Different function codes never share a `CoalescingKey` even at the same address (FC03 and FC04 read different Modbus tables). Different `unitId` bytes never coalesce (different PLC personalities behind a shared socket).
|
||
- **Bounded fan-out via `MaxParties`** (default 32 in `Mbproxy.Resilience.ReadCoalescing.MaxParties`). Once an entry has `MaxParties` interested clients, the next arrival opens a fresh entry — bounds the response-fanout cost per entry at O(MaxParties) and shields the backend reader task from pathological pile-on.
|
||
- **Hot-reloadable on/off.** `Mbproxy.Resilience.ReadCoalescing.Enabled` defaults to `true`. Flipping it to `false` at runtime leaves running coalesced entries to drain naturally; subsequent FC03/04 requests take the Phase-9 (one round-trip per upstream request) path.
|
||
- **Transparency contract preserved.** Each upstream client still sees its own original MBAP TxId on the response. The BCD rewriter runs once on the shared response buffer; per-party copies are only made when fan-out has more than one party.
|
||
|
||
Counter accounting balance (per snapshot): `coalescedHitCount + coalescedMissCount` equals the total FC03 + FC04 requests seen since the multiplexer was constructed. Both counters increment regardless of whether the coalescing feature is enabled — `coalescedHitCount` is 0 when disabled, but every read still increments `coalescedMissCount`.
|
||
|
||
## Response cache (Phase 11) — opt-in bounded-staleness cache
|
||
|
||
**⚠ Design-contract pivot.** Through Phase 10 the proxy is *purely transparent* — every upstream read corresponds 1:1 to a recent backend round-trip (or, with Phase 10, to a peer's in-flight backend round-trip in the same microseconds-to-milliseconds window). Phase 11 changes that contract: the proxy gains an **opt-in per-tag response cache** that may serve upstream FC03/FC04 reads from in-process memory with bounded staleness up to the operator-configured `CacheTtlMs`. **The cache is OFF by default** (`CacheTtlMs = 0` on every BCD tag unless explicitly set); a fresh post-Phase-11 deployment with no TTL configuration behaves identically to a Phase-10 deployment. Operators opt tags in explicitly as their acknowledgement of the staleness window.
|
||
|
||
### Cache contract
|
||
|
||
- **Per-tag TTL.** Each BCD tag carries an optional `CacheTtlMs` (in `BcdTagOptions`). `CacheTtlMs = 0` (the default) disables caching for that tag. The TTL resolution order is **explicit per-tag → per-PLC `DefaultCacheTtlMs` → 0**.
|
||
- **Multi-tag read range: effective TTL = `min(TTLs)`.** When a single FC03/FC04 read covers multiple configured tags, the cache uses the smallest TTL among them. If any tag in the read range has `CacheTtlMs = 0`, the **whole read is uncached** — the conservative-by-design choice.
|
||
- **Lookup order: cache → coalesce → backend.** A cache hit short-circuits Phase 10's coalescing entirely. Only on a miss does the request engage coalescing (Phase 10) and then the Phase 9 backend send path.
|
||
- **Cache populates on demand only.** No polling, no predictive prefetch. Entries are created in the backend reader task **after** the BCD rewriter has run on the response — the cache stores **POST-rewriter bytes**, so hits never re-invoke the rewriter (CPU win + behaviour-stable).
|
||
- **Write invalidation by ADDRESS RANGE OVERLAP.** A successful FC06 / FC16 response (non-exception) invalidates every cached FC03/FC04 entry whose address range `[StartAddress, StartAddress + Qty)` overlaps the write range. A write to register 105 invalidates a cached `[100..110]` read but not a cached `[200..210]` read. Exception responses do not invalidate (the write didn't take effect).
|
||
- **Different unit IDs never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3,4})`.
|
||
- **Cache survives backend disconnects.** A cached entry's data was valid when stored; a disconnect does not retroactively invalidate it. Invalidations during a `recovering` listener state are skipped (the write never reached the backend, the cached read remains valid).
|
||
- **No persistence.** Process restart wipes the cache. No file/Redis backing store, no last-known-good snapshot.
|
||
- **Hot-reload flushes the entire PLC cache.** Any tag-list change to a PLC drops every cached entry for that PLC. Per-tag flush granularity is intentionally not done in v1 — the simple correctness move is "any tag-list reload → drop all entries for the affected PLC and let them re-populate."
|
||
- **TTL > 60 s requires `Cache.AllowLongTtl = true`.** Validation rejects reloads that set `CacheTtlMs > 60_000` without this opt-in. Prevents "left at 1 hour by accident" deployments.
|
||
- **LRU-bounded capacity.** Each PLC's cache is capped at `Cache.MaxEntriesPerPlc` (default 1000). When full, the next insert evicts the least-recently-used entry. A background eviction loop (interval `Cache.EvictionIntervalMs`, default 5000) also scans for expired entries.
|
||
|
||
### Cache and the rewriter
|
||
|
||
The BCD rewriter runs once on the cache-miss path (the backend reader task decodes the response and stores the decoded bytes in the cache). Cache hits return pre-decoded bytes directly without re-invoking the rewriter — this is both a CPU optimisation and a correctness guarantee (any future rewriter change would not retroactively re-transform an entry that was decoded against an earlier rewriter version).
|
||
|
||
### Hot-reload semantics
|
||
|
||
| Change | Cache behaviour |
|
||
|--------|----------------|
|
||
| Tag's `CacheTtlMs` changed (any direction, 0 → N, N → 0, N → M) | Entire PLC cache is flushed; entries re-populate on demand under the new TTL. |
|
||
| New PLC added / removed | New PLC starts with empty cache; removed PLC's cache is discarded with the multiplexer. |
|
||
| `Cache.AllowLongTtl` flipped | Validation runs on next reload; existing entries unaffected. |
|
||
| `Cache.MaxEntriesPerPlc` changed | Existing entries unaffected; cap applies to subsequent inserts. |
|
||
| `Cache.EvictionIntervalMs` changed | Existing eviction loop continues until next dispose; subsequent loops use new interval. |
|
||
|
||
### Counter accounting
|
||
|
||
- `cacheHitCount` — FC03/FC04 requests served from the cache.
|
||
- `cacheMissCount` — FC03/FC04 requests that fell through to the coalescing/backend path. (Cache hit + Cache miss = total FC03/FC04 requests that were cache-eligible, i.e. whose resolved TTL was > 0; reads whose effective TTL is 0 increment neither.)
|
||
- `cacheInvalidations` — count of cache entries invalidated by FC06/FC16 write responses.
|
||
- `cacheEntryCount` — point-in-time snapshot of `ResponseCache.Count` (Tier-2 memory-watch KPI).
|
||
- `cacheBytes` — point-in-time approximation of cached PDU bytes (Tier-2 memory-watch KPI).
|
||
|
||
## Rewriter — function code scope
|
||
|
||
The rewriter inspects and rewrites payloads only for these function codes; every other FC (coils, discrete inputs, diagnostics, exception responses) passes through byte-for-byte:
|
||
|
||
| FC | Direction | Action |
|
||
|----|----------------|-----------------------------------------------------------------------|
|
||
| 03 | request + response | FC03 requests may be coalesced with peers before reaching the backend (see Phase-10 section above); response re-encodes covered BCD slots from raw nibbles → binary integer |
|
||
| 04 | request + response | Same coalescing eligibility as FC03; response re-encoding the same as FC03 (input-register table also surfaces V-memory) |
|
||
| 06 | request | Re-encode binary integer → BCD nibbles before forwarding |
|
||
| 06 | response | Decode BCD nibbles → binary integer on the echo (clients validate that the echoed value equals the value they sent; without this, NModbus-style clients throw on the round-trip) |
|
||
| 16 | request | Per-register over the configured slots, then forward |
|
||
|
||
**Partial-overlap policy.** A request that touches only ONE register of a configured 32-bit BCD pair (qty=1 at the low addr, or any read/write of the high addr alone) **passes through raw** with a `mbproxy.rewrite.partial_bcd` warning. The proxy never synthesises a Modbus exception for a partial-overlap — that response code is reserved for transport failure.
|
||
|
||
## Failure modes — transparent pass-through with Polly-bounded backend connect
|
||
|
||
- **PLC returns a Modbus exception (codes 01–04)** → forward verbatim with the original MBAP transaction ID. The client sees the real DL205/DL260 exception.
|
||
- **Backend connect refused or initial connect timeout** → retry under a Polly resilience pipeline: 3 attempts at 100ms / 500ms / 2000ms backoff (tuned via `Resilience.BackendConnect`). If all attempts fail, the multiplexer closes the upstream client connection that triggered the connect.
|
||
- **Backend mid-stream broken socket** → the multiplexer's reader/writer task throws; the backend tear-down path cancels both tasks, drains the correlation map, and **cascades the disconnect by closing every attached upstream pipe**. The next upstream request to any pipe triggers a fresh backend connect through the Polly pipeline. `BackendDisconnectCascades` counter records the upstream-pipe count at each cascade event.
|
||
- **Backend request timeout** → the per-request watchdog times out any correlation entry older than `Connection.BackendRequestTimeoutMs`, delivers Modbus exception 0x0B (Gateway Target Device Failed To Respond) with the original TxId to the upstream party, and frees the proxy TxId. **No mid-request retries** — FC06 / FC16 are non-idempotent on BCD tags (a partial-applied multi-register write could leave a 32-bit BCD tag mid-transition), so every in-flight request is one-shot. The client interprets the 0x0B as a transport failure and reconnects through its normal path.
|
||
- **Partial-BCD overlap** → forward raw + warn (see Rewriter section).
|
||
- **One slow PLC does not stall the rest of the fleet.** Each PLC has its own `PlcMultiplexer`, with its own backend socket, correlation map, and outbound channel; per-PLC failures are local. A slow or dead backend on one PLC only impacts that PLC's clients.
|
||
- **Cache during backend recovery (Phase 11).** Cache hits remain valid during a `recovering` listener state — the data was fresh when cached, and recovery only affects future requests. Writes that arrive during recovery never reach the backend, so the invalidation never happens. This is consistent: the write also didn't take effect on the PLC. Cached entries simply remain until their TTL expires.
|
||
|
||
## Startup posture — eager, continue on per-port failure
|
||
|
||
At startup the host attempts to bind **all 54 listen sockets up front**. Each failure (port already in use, invalid IP, malformed PLC entry) is logged at Error and handed off to the listener supervisor (next section). The service proceeds with whichever PLCs bound on the first attempt; the rest converge in the background. Monitoring should alert on `mbproxy.startup.bind.failed` so missing PLCs aren't silently dropped, and watch for `mbproxy.listener.recovered` to confirm late binds eventually succeeded.
|
||
|
||
## Listener auto-recovery (Polly-backed supervisor)
|
||
|
||
Each PLC's listener runs under a **supervisor task** that owns its bind lifecycle. If a bind fails at startup, or if a listener faults at runtime (port stolen by another process, transient OS network reset), the supervisor reattempts via a Polly retry pipeline: 5 attempts at 1s / 2s / 5s / 15s / 30s backoff, then steady-state retries every 30s indefinitely (tuned via `Resilience.ListenerRecovery`). Each attempt logs at Debug; the bind that finally succeeds emits one `mbproxy.listener.recovered` Information event.
|
||
|
||
While a supervisor is between attempts, the corresponding PLC is reported as `listener.state = recovering` on the status page. Hot-reload uses the same supervisor to bring newly-added PLCs online and to tear down removed ones — there is exactly one code path for "bring up a listener" and one for "shut a listener down."
|
||
|
||
## Logging — Serilog, structured, console + rolling file
|
||
|
||
Serilog wired through the Microsoft.Extensions.Logging bridge:
|
||
|
||
- **Console sink** for interactive `--console` runs.
|
||
- **Rolling-file sink** under `%ProgramData%\mbproxy\logs\`.
|
||
- **Default level** Information. Per-PLC and per-client scopes via `LogContext.PushProperty("Plc", name)` / `("Client", remoteEp)` so log lines are greppable across the fleet.
|
||
|
||
Stable event names (keep these stable so log queries don't churn):
|
||
|
||
| Event | Level | Properties |
|
||
|--------------------------------------|---------|---------------------------------------------|
|
||
| `mbproxy.startup.bind` | Info | `Plc`, `Port` |
|
||
| `mbproxy.startup.bind.failed` | Error | `Plc`, `Port`, `Reason` |
|
||
| `mbproxy.listener.recovered` | Info | `Plc`, `Port`, `AttemptCount` |
|
||
| `mbproxy.client.connected` | Info | `Plc`, `RemoteEp` |
|
||
| `mbproxy.client.disconnected` | Info | `Plc`, `RemoteEp`, `Reason` |
|
||
| `mbproxy.backend.failed` | Warning | `Plc`, `Reason` |
|
||
| `mbproxy.rewrite.partial_bcd` | Warning | `Plc`, `Address`, `ClientStart`, `ClientQty` |
|
||
| `mbproxy.rewrite.invalid_bcd` | Warning | `Plc`, `Address`, `RawValue`, `Direction` |
|
||
| `mbproxy.exception.passthrough` | Info | `Plc`, `Fc`, `ExceptionCode` |
|
||
| `mbproxy.config.reload.applied` | Info | `PlcsAdded`, `PlcsRemoved`, `TagDelta` |
|
||
| `mbproxy.config.reload.rejected` | Error | `Reason` |
|
||
| `mbproxy.admin.bind.failed` | Error | `Port`, `Reason` |
|
||
| `mbproxy.multiplex.backend.connected` | Info | `Plc`, `Host`, `Port` |
|
||
| `mbproxy.multiplex.backend.disconnected` | Warning | `Plc`, `UpstreamCount`, `InFlightCount`, `Reason` |
|
||
| `mbproxy.multiplex.saturated` | Error | `Plc`, `RemoteEp` (16-bit TxId space full) |
|
||
| `mbproxy.multiplex.request.timeout` | Warning | `Plc`, `ProxyTxId`, `OriginalTxId`, `Fc`, `ElapsedMs` |
|
||
| `mbproxy.coalesce.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `PartyCount` |
|
||
| `mbproxy.coalesce.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||
| `mbproxy.coalesce.dead_upstream` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||
| `mbproxy.cache.hit` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||
| `mbproxy.cache.miss` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty` |
|
||
| `mbproxy.cache.store` | Debug | `Plc`, `UnitId`, `Fc`, `Start`, `Qty`, `TtlMs` |
|
||
| `mbproxy.cache.invalidated` | Debug | `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count` |
|
||
| `mbproxy.cache.flushed` | Info | `Plc`, `Reason`, `Count` |
|
||
|
||
## Status page — read-only HTTP endpoint
|
||
|
||
A separate **Kestrel-hosted minimal API** runs on `Mbproxy.AdminPort` (default `8080`, distinct from the Modbus listen ports). The endpoint set is intentionally narrow — read-only telemetry; **no admin actions** (kick client, force reload, restart listener) are exposed:
|
||
|
||
- `GET /` — single self-contained HTML page rendering a table of all configured PLCs with their state and live counters. Auto-refreshes every 5s via a meta-refresh tag (no JS bundle, no external assets).
|
||
- `GET /status.json` — the same data as JSON for monitoring scrapers.
|
||
|
||
Authentication is assumed to live at the network layer (trusted internal segment behind a firewall). Surface that assumption in deployment docs when they exist.
|
||
|
||
**Service-wide fields:**
|
||
|
||
| Field | Meaning |
|
||
|-------|---------|
|
||
| `service.uptime` | Seconds since service start |
|
||
| `service.version` | Assembly informational version |
|
||
| `service.config.lastReloadUtc` | Timestamp of last accepted hot-reload (or `null`) |
|
||
| `service.config.reloadCount` | Number of reloads accepted since start |
|
||
| `service.config.reloadRejectedCount` | Number of reloads rejected since start |
|
||
| `listeners.bound` / `listeners.configured` | Bound listener count vs configured PLC count |
|
||
|
||
**Per-PLC fields** (one row per `Plcs[i]`):
|
||
|
||
| Field | Meaning |
|
||
|-------|---------|
|
||
| `name`, `host`, `listenPort` | Identity from config |
|
||
| `listener.state` | `bound` / `recovering` / `stopped` |
|
||
| `listener.lastBindError` | Most recent bind failure message (when `recovering`) |
|
||
| `listener.recoveryAttempts` | Polly retry count since last successful bind |
|
||
| `clients.connected` | Currently connected upstream client count |
|
||
| `clients.remoteEndpoints` | Array of `{ remote, connectedAtUtc, pdusForwarded }` |
|
||
| `pdus.forwarded` | Total PDUs (request+response) forwarded since start |
|
||
| `pdus.byFc` | `{ fc03, fc04, fc06, fc16, other }` request counts |
|
||
| `pdus.rewrittenSlots` | Count of register slots BCD-rewritten |
|
||
| `pdus.partialBcdWarnings` | Count of partial-overlap pass-throughs |
|
||
| `backend.connects.success` / `backend.connects.failed` | Polly-final-result counters |
|
||
| `backend.exceptions.byCode` | `{ "01": n, "02": n, "03": n, "04": n }` |
|
||
| `backend.lastRoundTripMs` | EWMA of recent successful round-trip times |
|
||
| `backend.coalescedHitCount` | FC03/04 requests that attached to an already-in-flight peer (Phase 10) |
|
||
| `backend.coalescedMissCount` | FC03/04 requests that opened a fresh backend round-trip (Phase 10). `Hit + Miss` = total FC03/04 requests |
|
||
| `backend.coalescedResponseToDeadUpstream` | Coalesced fan-out responses skipped because the attached upstream had already disconnected (Phase 10) |
|
||
| `backend.cacheHitCount` | FC03/04 reads served from the response cache (Phase 11) |
|
||
| `backend.cacheMissCount` | FC03/04 reads that fell through to coalescing/backend after a cache miss (Phase 11) |
|
||
| `backend.cacheInvalidations` | Cache entries invalidated by overlapping FC06/FC16 write responses (Phase 11) |
|
||
| `backend.cacheEntryCount` | Point-in-time snapshot of the per-PLC cache's entry count (Phase 11, Tier-2 memory-watch) |
|
||
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC (Phase 11, Tier-2 memory-watch) |
|
||
| `bytes.upstreamIn` / `bytes.upstreamOut` | Bytes forwarded each direction |
|
||
|
||
Counters are `System.Threading.Interlocked` longs read atomically per request; no locking on the read path.
|
||
|
||
## Test simulator — pymodbus DL260/DL205 server
|
||
|
||
The pymodbus profile at [`../DL260/dl205.json`](../DL260/dl205.json) already models the DL205/DL260 quirks (BCD nibbles at known addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings, etc.) as concrete register seeds. The test infrastructure wraps it as a managed lifecycle so every integration / e2e test gets a fresh known-good DL-series target without needing real hardware.
|
||
|
||
Harness shape (lives under `tests/sim/`):
|
||
|
||
- **Launcher script** — `tests/sim/run-dl205-sim.ps1` provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`), then launches `pymodbus.server` with the `dl205.json` profile on a configurable port. Idempotent: re-runs reuse the venv.
|
||
- **xUnit fixture** — `Mbproxy.Tests.Sim.DL205SimulatorFixture : IAsyncLifetime` that:
|
||
- `InitializeAsync`: spawns the simulator subprocess, polls `TcpClient.ConnectAsync` against the port until success or a 10 s deadline, captures stdout/stderr to test output.
|
||
- `DisposeAsync`: signals graceful shutdown (Ctrl-C on the process group on Windows), then `Process.Kill(entireProcessTree: true)` as a safety net.
|
||
- Exposes `Host`, `Port`, `LogTail` (last N lines of sim stderr for diagnosis).
|
||
- **Test collection** — `[CollectionDefinition(nameof(DL205SimulatorCollection))]` so the fixture is shared across all integration/e2e classes that opt in (cheap startup, expensive process churn).
|
||
- **Skip policy** — if Python or pymodbus isn't available and the auto-provision fails (no network, locked-down CI image, etc.), `InitializeAsync` records the reason and tests skip via `Assert.Skip(sim.SkipReason)`. CI must have Python 3.10+ available; local devs running only the rewriter unit tests need nothing extra.
|
||
- **Alternate profiles** — additional scenarios (e.g., a profile that seeds a specific partial-overlap test case, or a profile with strict `type exception: true` to verify the proxy doesn't depend on lax pymodbus behaviour) live alongside `dl205.json` and are selected via `MODBUS_SIM_PROFILE` env var, matching the pattern already established by [`../DL260/DL205BcdQuirkTests.cs`](../DL260/DL205BcdQuirkTests.cs).
|
||
|
||
The simulator IS the proxy's end-to-end test bed. A standard e2e test does:
|
||
|
||
1. Start the simulator at `127.0.0.1:<simPort>`.
|
||
2. Configure the proxy with one PLC entry `Host=127.0.0.1, Port=<simPort>, ListenPort=<proxyPort>`.
|
||
3. Start the proxy (in-process via `WebApplicationFactory`-style host construction).
|
||
4. Drive a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `127.0.0.1:<proxyPort>`.
|
||
5. Assert two directions:
|
||
- **Read**: client sees the BCD-decoded integer (proxy rewrote the response).
|
||
- **Write**: simulator's register state shows the BCD-encoded nibbles (proxy rewrote the request).
|
||
|
||
## Testing
|
||
|
||
- **Unit tests** — drive the BCD rewriter with synthetic Modbus PDU byte arrays. No network, no simulator. Cover every FC03/04/06/16 × {single 16-bit, full 32-bit pair, partial-overlap low, partial-overlap high, mixed-with-non-BCD} cell.
|
||
- **Integration tests** — drive the proxy end-to-end against the pymodbus simulator described in the previous section, using a plain Modbus TCP client (`NModbus` or `FluentModbus`) against `proxy:<listenPort>` and asserting the decoded value rather than the raw register bytes.
|
||
- **Auto-recovery tests** — bind a `TcpListener` on a target port BEFORE starting the proxy, assert that the supervisor enters `recovering` state, release the port, and assert the next supervisor attempt succeeds and `mbproxy.listener.recovered` fires. Also cover the runtime-fault path by forcing the accept loop to throw and asserting the supervisor reattempts.
|
||
- **Hot-reload tests** — write a temp `appsettings.json`, start the host, mutate the file (add a PLC, remove a PLC, change a global tag width), and assert: (a) supervisor adds/removes the affected listener, (b) the rewriter on the next PDU reflects the new tag map, (c) a malformed reload is rejected without breaking the running config. Cover both `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` paths.
|
||
- **Status page tests** — start the host, induce known events (connect 2 clients, force a backend exception, trigger a partial-BCD warning), and assert `GET /status.json` returns the expected counters. The HTML page is verified separately as a smoke test that the route returns 200 with `text/html`.
|