mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide

Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 03:44:34 -04:00
parent 4fcda87ecd
commit f49e27e316
12 changed files with 3363 additions and 0 deletions
+422
View File
@@ -0,0 +1,422 @@
# Configuration Reference
`mbproxy` binds its runtime configuration from `appsettings.json` under the `Mbproxy` section. This document is the full reference for every supported key, its type, default, range, and validation rules.
## File Location
The configuration loader resolves `appsettings.json` relative to the executable.
- **Development run** (`dotnet run`): `src/Mbproxy/appsettings.json` next to the build output.
- **Single-file publish** (`dotnet publish -c Release -r win-x64`): `appsettings.json` next to `Mbproxy.exe` in the publish folder.
- **Installed as a Windows Service**: `%ProgramData%\mbproxy\appsettings.json`. The install script copies the template at `install/mbproxy.config.template.json` to this path the first time only — an existing file is preserved across reinstalls.
The file is loaded with `reloadOnChange: true`. All consumers read through `IOptionsMonitor<MbproxyOptions>`, so a save propagates without restarting the service. See [`../Features/HotReload.md`](../Features/HotReload.md) for per-key propagation semantics.
The .NET configuration provider accepts `//` and `/* */` comments (JSONC) in `appsettings.json` when loaded through `Host.CreateApplicationBuilder`. The install template ships with comments.
Environment variables and command-line arguments are also accepted by the host. Either form can override any `Mbproxy:*` key; for example, `Mbproxy__AdminPort=9090` (double-underscore segment separator) overrides the JSON. Environment overrides are useful for ephemeral diagnostic switches but should not replace the file as the source of truth — `ReloadValidator` runs against the merged configuration on every reload.
## Top-Level Schema
Every supported key under `Mbproxy:*`, populated to a representative default:
```jsonc
{
"Mbproxy": {
// Global BCD tag list — applies to every PLC unless overridden per-PLC.
"BcdTags": {
"Global": [
{ "Address": 1024, "Width": 16 }, // 16-bit BCD register
{ "Address": 1056, "Width": 32 }, // 32-bit BCD pair (CDAB)
{ "Address": 1088, "Width": 16, "CacheTtlMs": 1000 } // opt-in cache, 1 s TTL
]
},
// One entry per PLC. Each maps an upstream proxy port to a backend Modbus TCP endpoint.
"Plcs": [
{
"Name": "Line1-Mixer",
"ListenPort": 5020,
"Host": "10.0.1.1",
"Port": 502,
"DefaultCacheTtlMs": 0,
"BcdTags": {
"Add": [ { "Address": 1200, "Width": 32 } ],
"Remove": [ 1056 ]
}
}
],
// Read-only HTTP status page. Set to 0 to disable.
"AdminPort": 8080,
// Backend connection / request / shutdown timeouts.
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000,
"GracefulShutdownTimeoutMs": 10000
},
// Polly resilience policies.
"Resilience": {
"BackendConnect": {
"MaxAttempts": 3,
"BackoffMs": [ 100, 500, 2000 ]
},
"ListenerRecovery": {
"InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ],
"SteadyStateMs": 30000
},
"ReadCoalescing": {
"Enabled": true,
"MaxParties": 32
}
},
// Response-cache safety knobs. The cache is off by default per tag.
"Cache": {
"AllowLongTtl": false,
"MaxEntriesPerPlc": 1000,
"EvictionIntervalMs": 5000
}
}
}
```
`Serilog` configuration is documented in [`./Troubleshooting.md`](./Troubleshooting.md) and lives outside the `Mbproxy` section.
## `Mbproxy.AdminPort`
Port for the read-only HTTP status server. Binds to all interfaces on startup.
| Field | Type | Default | Range |
|-------|------|---------|-------|
| `AdminPort` | int | `8080` | `[1, 65535]` |
`ReloadValidator` rejects values outside `[1, 65535]` and rejects collisions with any `Plcs[i].ListenPort`. Source: `MbproxyOptions.AdminPort`.
The server exposes `GET /` (auto-refreshing HTML) and `GET /status.json`. See [`./StatusPage.md`](./StatusPage.md) for the schema.
Authentication is assumed at the network layer (trusted internal segment). The endpoint is read-only — there are no `POST` / `PUT` / `DELETE` routes — so the risk surface is limited to status disclosure. Place the admin port behind a firewall rule that allows only operator workstations.
## `Mbproxy.Plcs[]`
One entry per PLC. The array drives the listener supervisor; on reload, entries added here cause new listeners to bind and entries removed here cause listeners to stop. Source: `PlcOptions.cs`.
| Field | Type | Default | Required | Notes |
|-------|------|---------|----------|-------|
| `Name` | string | `""` | yes | Non-empty, unique across the array. Shown on the status page and in structured logs as `plc`. |
| `ListenPort` | int | `0` | yes | Port the proxy listens on. `[1, 65535]`. Unique across the array. Cannot collide with `AdminPort`. |
| `Host` | string | `""` | yes | PLC IP address or hostname. |
| `Port` | int | `502` | no | Backend Modbus TCP port on the PLC. |
| `BcdTags` | object | `null` | no | Per-PLC overrides on top of `Mbproxy.BcdTags.Global`. See below. |
| `DefaultCacheTtlMs` | int | `0` | no | Fallback TTL in milliseconds for any tag on this PLC whose explicit `CacheTtlMs` is unset (`null`). `0` disables caching by default. |
### `Plcs[i].BcdTags`
Per-PLC override block. Resolution: the effective tag list for a PLC is `Global Add Remove`, with `Add` winning on width when an address appears in both `Global` and `Add`. Source: `BcdTagListOptions.PlcBcdOverrides`.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `Add` | `BcdTagOptions[]` | `[]` | Tags to append for this PLC. Can override a `Global` entry's `Width` by repeating the address. Each entry follows the `BcdTagOptions` shape (see next section). |
| `Remove` | `ushort[]` | `[]` | Addresses to drop from this PLC's effective list. Matches by address. |
The full tag-list resolution algorithm — `Add` width override semantics, overlap detection, and per-PLC tag-map flushing on reload — is documented in [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md).
A subtle case worth pinning down: when an address appears in both `Mbproxy.BcdTags.Global[]` and `Plcs[i].BcdTags.Add[]`, the per-PLC `Add` entry wins on `Width` and `CacheTtlMs`. This is how a per-PLC width override is expressed (for example, a 16-bit tag globally, promoted to 32-bit on the one PLC that uses the wider format). To strip a global tag from a PLC entirely, use `Remove`; do not add a same-address entry with `Width = 0`.
## `Mbproxy.BcdTags.Global[]`
The fleet-wide BCD tag list. Every PLC starts with this set, then applies its per-PLC `Add` / `Remove` overrides. Source: `BcdTagListOptions.Global`, entries of type `BcdTagOptions`.
| Field | Type | Default | Range | Notes |
|-------|------|---------|-------|-------|
| `Address` | ushort | `0` | `[0, 65535]` | Modbus PDU address (decimal). Address `0` is valid on DL205/DL260 — do not skip it. Octal V-memory addresses must be converted: `V2000` octal = decimal `1024`. |
| `Width` | byte | `0` | `{ 16, 32 }` | Bit width. `16` is one register holding 4 BCD digits (`09999`). `32` is a CDAB-ordered register pair at `Address` (low word) and `Address+1` (high word). |
| `CacheTtlMs` | int? | `null` | `>= 0`, `<= 60000` unless `Cache.AllowLongTtl = true` | Optional per-tag opt-in to the response cache. `null` falls back to the PLC's `DefaultCacheTtlMs`. `0` explicitly disables caching for this tag even when the PLC default is non-zero. |
`MbproxyOptionsValidator` rejects any entry whose `Width` is not `16` or `32`. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) for the wire encoding rules and the multi-tag-overlap validation that runs in `BcdTagMapBuilder`.
Address conversion examples for operators coming from DirectLOGIC ladder:
| V-memory (octal) | Modbus PDU (decimal) |
|------------------|----------------------|
| `V2000` | `1024` |
| `V2040` | `1056` |
| `V2100` | `1088` |
| `V2200` | `1152` |
The proxy expects PDU-decimal addresses. Do not use octal V-memory addresses and do not use 1-based `4xxxx` Modbus references — both will resolve to the wrong register.
## `Mbproxy.Connection`
Backend connection and shutdown timeouts. Source: `ConnectionOptions.cs`.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `BackendConnectTimeoutMs` | int | `3000` | Max time in milliseconds to wait for one TCP connect to the backend PLC. Each Polly retry attempt is bounded by its own copy of this timeout — total worst-case connect time is `MaxAttempts * BackendConnectTimeoutMs` plus the configured backoffs. |
| `BackendRequestTimeoutMs` | int | `3000` | Max time in milliseconds to wait for the PLC to respond to a forwarded PDU. On timeout the upstream client is disconnected. FC06 / FC16 writes are not retried because they are non-idempotent on BCD tags; FC03 / FC04 reads are also not retried mid-request (a fresh upstream request takes the full pipeline again). |
| `GracefulShutdownTimeoutMs` | int | `10000` | Max time in milliseconds the shutdown coordinator waits for in-flight PDUs to drain after a stop signal (`sc.exe stop` or Windows Service stop). After this deadline, remaining work is cancelled. Keep at or below the Service Control Manager wait hint (30 s). |
On hot reload, `BackendConnectTimeoutMs` and `BackendRequestTimeoutMs` apply to the next backend connect or request — in-flight operations keep their already-applied timeout. `GracefulShutdownTimeoutMs` is sampled only at shutdown.
Operational sizing notes:
- The default 3 s connect timeout is appropriate for a local Ethernet segment to a healthy ECOM100. On WAN paths or for devices behind switches with slow MAC-table aging, raise to 510 s.
- A 3 s request timeout is generous compared with typical DL205/DL260 scan times (a few ms to tens of ms for FC03 of 100 registers). The slack absorbs PLC scan-overlap jitter without faulting the upstream client.
- `GracefulShutdownTimeoutMs` should be less than the Service Control Manager's stop deadline. The default 10 s suits a fleet of 54 PLCs; on a much larger fleet, raise both the SCM wait hint and this value in lockstep.
## `Mbproxy.Resilience`
Polly retry pipelines for backend connect, listener bind, and the in-flight read coalescer. Source: `ResilienceOptions.cs`.
### `Mbproxy.Resilience.BackendConnect`
Bounded retries on the backend TCP connect path. Mid-request failures (during a forwarded PDU) are never retried.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `MaxAttempts` | int | `3` | Total connect tries, including the first. `1` disables retries. |
| `BackoffMs` | int[] | `[100, 500, 2000]` | Delay in milliseconds between attempts. Must have `MaxAttempts - 1` entries. |
### `Mbproxy.Resilience.ListenerRecovery`
Unbounded retries on the listener bind path. If a PLC's `ListenPort` cannot be bound (port in use, bad interface, transient OS error), the supervisor cycles through `InitialBackoffMs` once, then repeats `SteadyStateMs` forever. The same recovery code path also reacts to a listener that faults at runtime (for example, the underlying socket dies) and to listeners that come online from a hot-reload that adds a new PLC.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `InitialBackoffMs` | int[] | `[1000, 2000, 5000, 15000, 30000]` | Backoff schedule for the first N retries after a fault. |
| `SteadyStateMs` | int | `30000` | Backoff for every retry after the initial schedule is exhausted. Runs indefinitely. |
### `Mbproxy.Resilience.ReadCoalescing`
In-flight de-duplication of identical FC03 / FC04 reads. When multiple upstream clients issue the same `(unitId, fc, startAddress, qty)` tuple while a matching backend round-trip is already in flight, the late arrivals attach to the existing entry and the single response is fanned out to every party. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md).
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `Enabled` | bool | `true` | Master switch. Hot-reloadable; flipping to `false` lets already-coalesced entries drain naturally. |
| `MaxParties` | int | `32` | Per-entry cap on attached parties. Past this cap, the next identical request opens a fresh in-flight entry. |
Writes (FC06 / FC16) are never coalesced. FC03 and FC04 never share an entry. Different `unitId` bytes never share an entry.
Total FC03 + FC04 request accounting is preserved across the coalescing path: `coalescedHitCount + coalescedMissCount` equals the total reads observed by the multiplexer since startup. `coalescedHitCount` stays at `0` while `Enabled = false`, but every read still increments `coalescedMissCount`. See [`./StatusPage.md`](./StatusPage.md) for the full counter catalogue.
## `Mbproxy.Cache`
Service-wide safety knobs for the opt-in response cache. The cache is off by default per tag — this section only governs the limits when an operator opts a tag in via `CacheTtlMs` or `DefaultCacheTtlMs`. Source: `CacheOptions` in `MbproxyOptions.cs`.
| Field | Type | Default | Notes |
|-------|------|---------|-------|
| `AllowLongTtl` | bool | `false` | Gate for any `CacheTtlMs > 60_000`. When `false`, `ReloadValidator` rejects any tag or PLC default that exceeds 60 s. Set to `true` to opt in explicitly. |
| `MaxEntriesPerPlc` | int | `1000` | LRU cap on the number of entries per PLC. When full, the next insert evicts the least-recently-used entry. Must be `>= 0`. `0` is accepted but means "evict every insert immediately" — effectively the cache is disabled even for tags with non-zero TTL. |
| `EvictionIntervalMs` | int | `5000` | Background eviction loop tick in milliseconds. Each tick scans the per-PLC caches and removes entries past their `ExpiresAtUtc`. Must be `>= 0`; values below 100 ms are clamped at 100 ms internally to avoid pathologically tight loops. |
On hot reload, `AllowLongTtl` is enforced by the next reload validation. `MaxEntriesPerPlc` applies to subsequent inserts (existing entries are not pruned). `EvictionIntervalMs` is read by each fresh eviction loop iteration.
Any tag-list change for a given PLC drops that PLC's entire cache on reload — per-tag flush granularity is intentionally not implemented. New entries re-populate on demand under the new TTL. Process restart wipes every cache; there is no persistence and no last-known-good snapshot.
See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) for the cache contract (lookup order, write-invalidation by address-range overlap, post-rewriter byte storage).
## Per-Tag `CacheTtlMs`
Per-tag opt-in to the cache. The same field appears on every `BcdTagOptions` entry — both `Mbproxy.BcdTags.Global[]` and `Mbproxy.Plcs[i].BcdTags.Add[]`.
| Value | Meaning |
|-------|---------|
| `null` (omitted) | Unset. Falls back to `Plcs[i].DefaultCacheTtlMs`. |
| `0` | Caching explicitly disabled for this tag, even if the PLC default is non-zero. |
| `1..60000` | Cache enabled with this TTL in milliseconds. |
| `> 60000` | Rejected at reload unless `Cache.AllowLongTtl = true`. |
TTL resolution order for any single tag: **explicit per-tag value → per-PLC `DefaultCacheTtlMs` → 0 (off)**.
For multi-tag read ranges, the effective TTL is `min(TTLs)` across all configured tags inside the read range. If any tag in the range has `CacheTtlMs = 0`, the entire read is uncached.
The cache itself is described in detail in [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md). The properties most relevant to operators setting TTLs:
- **Lookup order is cache → coalesce → backend.** A cache hit short-circuits the read coalescer entirely.
- **Writes invalidate by address-range overlap.** A successful FC06 / FC16 response invalidates every cached FC03 / FC04 entry whose read range overlaps the write range — not just exact-key matches. Exception responses do not invalidate (the write did not take effect on the PLC).
- **Cache stores post-rewriter bytes.** Hits never re-invoke the BCD rewriter. Tag-list reloads flush the affected PLC's whole cache so a rewriter-relevant change cannot serve stale post-rewriter bytes from before the change.
- **Different `unitId` bytes never invalidate each other.** Invalidation is scoped to `(unitId, FC ∈ {3, 4})`.
## Validation Rules
`ReloadValidator.Validate` runs on every config load (startup and hot reload) and rejects the entire snapshot if any rule fails. On rejection at startup, the service exits non-zero. On rejection at runtime, the current in-memory config stays in effect and `mbproxy.config.reload.rejected` is logged at `Error`.
Rules (in order):
1. **PLC names**: every `Plcs[i].Name` is non-empty and unique (ordinal comparison).
2. **ListenPort**: every `Plcs[i].ListenPort` is in `[1, 65535]` and unique across the array.
3. **AdminPort**: in `[1, 65535]` and does not collide with any `ListenPort`.
4. **BCD tag map** per PLC, delegated to `BcdTagMapBuilder.Build`:
- duplicate addresses within a single PLC's resolved tag list
- 32-bit entries whose high register (`Address + 1`) overlaps a separate 16-bit entry at that address
5. **Cache TTL bounds**:
- any `CacheTtlMs` or `DefaultCacheTtlMs` less than 0 is rejected
- any `CacheTtlMs` or `DefaultCacheTtlMs` greater than `60_000` is rejected unless `Cache.AllowLongTtl = true`
6. **Cache size knobs**: `Cache.MaxEntriesPerPlc >= 0`, `Cache.EvictionIntervalMs >= 0`.
7. **Width**: every `BcdTagOptions.Width` is `16` or `32` (enforced by `MbproxyOptionsValidator` at schema time).
Sample rejection messages (logged at `Error` with the structured property `errors` carrying the full list):
```text
Plc 'Line1-Mixer': Duplicate ListenPort 5020 (already used by 'Line1-Conveyor').
AdminPort 5020 collides with ListenPort of PLC 'Line1-Mixer'.
Plc 'Line1-Mixer': BCD tag map error (DuplicateAddress): address 1024 appears twice.
BcdTags.Global Address 1024: CacheTtlMs=120000 exceeds 60_000 ms without Cache.AllowLongTtl=true.
Plcs[2] (Line2-Press): DefaultCacheTtlMs must be >= 0.
```
Warning case (not a rejection):
- `Plcs[i].BcdTags.Remove[]` entries that do not match any global tag address are logged as warnings — probably stale config, but the reload proceeds.
Two additional rejection categories handled earlier in the pipeline:
- **Type-mismatched / malformed JSON.** The .NET configuration binder rejects values whose type does not match the bound property (for example, a string in `BackendConnectTimeoutMs`). At startup this aborts the host; on hot reload the binder retains the previous snapshot and the reload never reaches `ReloadValidator`.
- **Width invalid.** `MbproxyOptionsValidator` rejects any `BcdTagOptions.Width` that is not `16` or `32`. This runs as part of options validation before `ReloadValidator` and surfaces the same way as schema errors.
See [`../Features/HotReload.md`](../Features/HotReload.md) for the full reload-acceptance flow, including the log event names emitted on acceptance (`mbproxy.config.reload.applied`) and rejection (`mbproxy.config.reload.rejected`).
## Two Concrete Examples
The minimal and production examples below are both complete `appsettings.json` snippets — paste either one and the service will start without further edits beyond the addresses and ports.
### Minimal
One PLC, no BCD tags, no cache. The proxy is pure pass-through.
```jsonc
{
"Mbproxy": {
"BcdTags": { "Global": [] },
"Plcs": [
{
"Name": "Line1-Mixer",
"ListenPort": 5020,
"Host": "10.0.1.1"
}
],
"AdminPort": 8080
}
}
```
Everything else picks up defaults: `Port = 502`, `Connection.BackendConnectTimeoutMs = 3000`, `Connection.BackendRequestTimeoutMs = 3000`, `Connection.GracefulShutdownTimeoutMs = 10000`, `Resilience.BackendConnect.MaxAttempts = 3`, `Resilience.ReadCoalescing.Enabled = true`, `Cache.AllowLongTtl = false`, `Cache.MaxEntriesPerPlc = 1000`, `Cache.EvictionIntervalMs = 5000`, and so on.
Behaviour in this snapshot: every byte passes through unchanged in both directions, FC03 / FC04 reads are still subject to in-flight coalescing (the feature is on by default), and no responses are cached.
### Production
Three PLCs, a global BCD tag list, one PLC with overrides, cache enabled on hot reads.
```jsonc
{
"Mbproxy": {
"BcdTags": {
"Global": [
{ "Address": 1024, "Width": 16 }, // V2000 — 16-bit BCD counter
{ "Address": 1056, "Width": 32 }, // V2040 — 32-bit BCD total
{ "Address": 1088, "Width": 16, "CacheTtlMs": 1000 } // V2100 — setpoint, 1 s cache
]
},
"Plcs": [
{
"Name": "Line1-Mixer",
"ListenPort": 5020,
"Host": "10.0.1.1",
"Port": 502,
"DefaultCacheTtlMs": 0,
"BcdTags": {
"Add": [ { "Address": 1200, "Width": 32 } ],
"Remove": [ 1056 ]
}
},
{
"Name": "Line1-Conveyor",
"ListenPort": 5021,
"Host": "10.0.1.2"
},
{
"Name": "Line2-Press",
"ListenPort": 5022,
"Host": "10.0.2.1",
"DefaultCacheTtlMs": 500
}
],
"AdminPort": 8080,
"Connection": {
"BackendConnectTimeoutMs": 3000,
"BackendRequestTimeoutMs": 3000,
"GracefulShutdownTimeoutMs": 10000
},
"Resilience": {
"BackendConnect": { "MaxAttempts": 3, "BackoffMs": [ 100, 500, 2000 ] },
"ListenerRecovery": { "InitialBackoffMs": [ 1000, 2000, 5000, 15000, 30000 ], "SteadyStateMs": 30000 },
"ReadCoalescing": { "Enabled": true, "MaxParties": 32 }
},
"Cache": {
"AllowLongTtl": false,
"MaxEntriesPerPlc": 1000,
"EvictionIntervalMs": 5000
}
}
}
```
In this snapshot, `Line1-Mixer` adds a 32-bit tag at `1200` and removes the global 32-bit tag at `1056`. `Line2-Press` opts every tag in (whose `CacheTtlMs` is `null`) into a 500 ms cache via its `DefaultCacheTtlMs`. The setpoint at `1088` already has an explicit per-tag TTL and that value wins.
The effective tag map per PLC after resolution:
| PLC | Effective tag list |
|-----|--------------------|
| `Line1-Mixer` | `1024` (16-bit), `1088` (16-bit, `CacheTtlMs = 1000`), `1200` (32-bit). `1056` is removed. |
| `Line1-Conveyor` | `1024` (16-bit), `1056` (32-bit), `1088` (16-bit, `CacheTtlMs = 1000`). |
| `Line2-Press` | `1024` (16-bit, effective `CacheTtlMs = 500` via PLC default), `1056` (32-bit, effective `CacheTtlMs = 500`), `1088` (16-bit, effective `CacheTtlMs = 1000` from explicit per-tag value). |
Any FC03 / FC04 read whose register range overlaps `Line2-Press`'s tag `1088` resolves to the per-tag 1 s TTL. A read that spans tags with different TTLs takes `min(TTLs)` across the range; a read that includes a tag with `CacheTtlMs = 0` is uncached even if every other tag in the range is opted in.
## Hot-Reload Propagation Summary
A reduced view of [`../Features/HotReload.md`](../Features/HotReload.md), restricted to the keys documented here. Every accepted reload emits `mbproxy.config.reload.applied` at `Information` with a summary of which PLCs were added or removed and the size of the tag-list delta.
| Change | Propagation |
|--------|-------------|
| `BcdTags.Global` add / remove / width | Rewriter dereferences `IOptionsMonitor` per PDU. Next PDU sees the new map; in-flight PDUs are not retroactively touched. |
| `Plcs[i].BcdTags.{Add,Remove}` | Same per-PDU resolution as above, scoped to the affected PLC. |
| New `Plcs[i]` entry | Listener supervisor binds the new port under `ListenerRecovery`. |
| `Plcs[i]` removed | Supervisor stops the listener and closes all upstream connections for that PLC. |
| `Plcs[i].ListenPort` or `Host` changed | Equivalent to remove + add. |
| `Connection.Backend*TimeoutMs` | Next backend connect or request uses the new value. |
| `AdminPort` | Requires a service restart — the Kestrel admin host is built once at startup. |
| `Resilience.ReadCoalescing.Enabled` | Hot-reloadable; in-flight coalesced entries drain naturally. |
| `BcdTags.*.CacheTtlMs`, `Plcs[i].DefaultCacheTtlMs` | Tag-map reseat for the affected PLC drops that PLC's entire cache. |
| `Cache.AllowLongTtl` / `MaxEntriesPerPlc` / `EvictionIntervalMs` | Enforced on next reload validation / next insert / next eviction tick respectively. |
## Where Options Live in Code
| Section | File | Binding class |
|---------|------|---------------|
| Root | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptions` |
| `Plcs[]` | `src/Mbproxy/Options/PlcOptions.cs` | `PlcOptions` |
| `BcdTags.Global[]` entry shape | `src/Mbproxy/Options/BcdTagOptions.cs` | `BcdTagOptions` |
| `BcdTags.Global` / `Plcs[i].BcdTags` | `src/Mbproxy/Options/BcdTagListOptions.cs` | `BcdTagListOptions`, `PlcBcdOverrides` |
| `Connection` | `src/Mbproxy/Options/ConnectionOptions.cs` | `ConnectionOptions` |
| `Resilience` | `src/Mbproxy/Options/ResilienceOptions.cs` | `ResilienceOptions`, `RetryProfile`, `RecoveryProfile`, `ReadCoalescingOptions` |
| `Cache` | `src/Mbproxy/Options/MbproxyOptions.cs` | `CacheOptions` (declared alongside `MbproxyOptions` in the same file) |
| Schema validation | `src/Mbproxy/Options/MbproxyOptions.cs` | `MbproxyOptionsValidator` |
| Reload validation | `src/Mbproxy/Configuration/ReloadValidator.cs` | `ReloadValidator` |
| Tag-map resolution | `src/Mbproxy/Bcd/BcdTagMapBuilder.cs` | `BcdTagMapBuilder` |
| Reload reconciliation | `src/Mbproxy/Configuration/ConfigReconciler.cs` | `ConfigReconciler`, `ReloadPlan` |
All option classes are registered through `services.Configure<T>(...)` against the `Mbproxy:*` section in `Program.cs`. `IOptionsMonitor<T>` is the runtime read path; direct `IOptions<T>` injection is not used because it does not propagate reloads.
## Related Documentation
- [`../Features/HotReload.md`](../Features/HotReload.md) — reload acceptance flow and per-key propagation semantics
- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — tag-list resolution, wire encoding, multi-tag overlap rules
- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache contract, lookup order, write invalidation
- [`./StatusPage.md`](./StatusPage.md) — schema served by `AdminPort`
- [`./Troubleshooting.md`](./Troubleshooting.md) — Serilog block and common config rejection diagnostics
- [`../../README.md`](../../README.md) — install and operational entry point
+334
View File
@@ -0,0 +1,334 @@
# Status Page
The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at `GET /` and a JSON twin at `GET /status.json` that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions.
## Endpoint Surface
The admin endpoint is owned by `AdminEndpointHost` (see `src/Mbproxy/Admin/AdminEndpointHost.cs`). It exposes exactly two routes:
- `GET /` — a single self-contained HTML document with a `<meta http-equiv="refresh" content="5">` tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch.
- `GET /status.json` — the same in-memory snapshot serialized as JSON via the source-generated `StatusJsonContext` (camelCase property names).
The endpoint is **read-only**. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via `IOptionsMonitor`; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to `IPAddress.Any` on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall.
Both routes call `StatusSnapshotBuilder.Build()` for every request. The builder reads atomic counters directly from the supervisor map and per-PLC `ProxyCounters`; it holds no locks and performs no I/O.
## Port and Configuration
The listen port is read from `Mbproxy.AdminPort` and defaults to `8080`. Configuration semantics for this key live in [`./Configuration.md`](./Configuration.md).
If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs `mbproxy.admin.bind.failed` at `Error` level with the underlying reason. The host then sets `_app = null` and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See [`../Reference/LogEvents.md`](../Reference/LogEvents.md) for the event-id catalogue.
If `Mbproxy.AdminPort` changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint.
## Service-Wide Fields
Top-level fields come from `ServiceFields` and `ListenersAggregate` in `src/Mbproxy/Admin/StatusDto.cs`.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `service.uptimeSeconds` | `long` | `ServiceFields.UptimeSeconds` | Seconds since process start, computed as `now - ServiceCounters.StartedAtUtc` at snapshot time. |
| `service.version` | `string` | `ServiceFields.Version` via `AssemblyVersionAccessor` | `AssemblyInformationalVersion` of the running assembly. Useful for confirming a deployment took effect. |
| `service.configLastReloadUtc` | `DateTimeOffset?` | `ServiceCounters.LastReloadUtc` | Wall-clock time of the most recent **accepted** hot-reload. `null` if no reload has occurred since process start. See [`../Features/HotReload.md`](../Features/HotReload.md). |
| `service.configReloadCount` | `int` | `ServiceCounters.ReloadAppliedCount` | Number of `appsettings.json` reloads that validated and applied since process start. |
| `service.configReloadRejectedCount` | `int` | `ServiceCounters.ReloadRejectedCount` | Number of reload attempts rejected by validation. A non-zero value here paired with a stale `configLastReloadUtc` indicates the operator's last edit was malformed and the service is still running the previous config. |
| `listeners.bound` | `int` | `boundCount` accumulated while iterating `opts.Plcs` | Count of PLC entries whose supervisor currently reports `SupervisorState.Bound`. |
| `listeners.configured` | `int` | `opts.Plcs.Count` | Total number of PLC entries in the active configuration. |
Operator triggers:
- `listeners.bound < listeners.configured` for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC `listener.state` and `listener.lastBindError` fields below.
- `configReloadRejectedCount` rising means edits are reaching the watcher but failing validation — check the live log for `mbproxy.config.reload.rejected`.
## Per-PLC Fields
Each entry in `plcs[]` is a `PlcStatus` (see `src/Mbproxy/Admin/StatusDto.cs`). The builder iterates `opts.Plcs` in configured order, looks up the matching supervisor in `ProxyWorker.Supervisors`, and projects the supervisor's `CurrentCounters.Snapshot()` into wire fields.
### Identity
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `name` | `string` | `PlcOptions.Name` | Stable identifier from `appsettings.json`. Used as the dictionary key for supervisor lookup. |
| `host` | `string` | `PlcOptions.Host` | Backend PLC host (IP or DNS name) the proxy connects out to. |
| `listenPort` | `int` | `PlcOptions.ListenPort` | Local TCP port the proxy binds for upstream clients connecting *to* the proxy. |
### Listener state
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `listener.state` | `string` | `SupervisorSnapshot.State` mapped to `"bound"` / `"recovering"` / `"stopped"` | Current supervisor state. `bound` = TCP listener is accepting connections; `recovering` = Polly retry loop is trying to re-bind after a fault; `stopped` = no supervisor entry (typically a PLC that was just added and not yet started). |
| `listener.lastBindError` | `string?` | `SupervisorSnapshot.LastBindError` | Message from the last bind exception. Populated whenever `state == "recovering"`. Common values: `"Address already in use"`, `"Permission denied"`. |
| `listener.recoveryAttempts` | `int` | `SupervisorSnapshot.RecoveryAttempts` | Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent. |
### Client tracking
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `clients.connected` | `int` | `clientSnapshots.Count` | Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC. |
| `clients.remoteEndpoints[].remote` | `string` | `UpstreamPipe.RemoteEp` | Upstream TCP endpoint as `ip:port`. |
| `clients.remoteEndpoints[].connectedAtUtc` | `DateTimeOffset` | `UpstreamPipe.ConnectedAtUtc` | Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage. |
| `clients.remoteEndpoints[].pdusForwarded` | `long` | `UpstreamPipe.PdusForwardedCount` | PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic. |
### PDU traffic
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `pdus.forwarded` | `long` | `CounterSnapshot.PdusForwarded` | Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter. |
| `pdus.byFc.fc03` | `long` | `CounterSnapshot.Fc03` | Count of FC03 (read holding registers) requests seen. |
| `pdus.byFc.fc04` | `long` | `CounterSnapshot.Fc04` | Count of FC04 (read input registers) requests seen. |
| `pdus.byFc.fc06` | `long` | `CounterSnapshot.Fc06` | Count of FC06 (write single register) requests seen. |
| `pdus.byFc.fc16` | `long` | `CounterSnapshot.Fc16` | Count of FC16 (write multiple registers) requests seen. |
| `pdus.byFc.other` | `long` | `CounterSnapshot.FcOther` | All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched. |
| `pdus.rewrittenSlots` | `long` | `CounterSnapshot.RewrittenSlots` | Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). |
| `pdus.partialBcdWarnings` | `long` | `CounterSnapshot.PartialBcdWarnings` | Count of requests whose `[start, qty)` range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape. |
### Backend health
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.connectsSuccess` | `long` | `CounterSnapshot.ConnectsSuccess` | Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client). |
| `backend.connectsFailed` | `long` | `CounterSnapshot.ConnectsFailed` | Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap. |
| `backend.exceptionsByCode.code01` | `long` | `CounterSnapshot.BackendException01` | Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support. |
| `backend.exceptionsByCode.code02` | `long` | `CounterSnapshot.BackendException02` | Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map. |
| `backend.exceptionsByCode.code03` | `long` | `CounterSnapshot.BackendException03` | Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100). |
| `backend.exceptionsByCode.code04` | `long` | `CounterSnapshot.BackendException04` | Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode. |
| `backend.lastRoundTripMs` | `double` | `CounterSnapshot.LastRoundTripMs` | Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation. |
### Multiplexer state
These five fields describe the per-PLC backend multiplexer. See [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) for the design rationale and how transaction-id (TxId) reuse and queueing work.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.inFlight` | `long` | `CounterSnapshot.InFlightCount` | Number of MBAP transactions currently in flight on the backend socket (request sent, response pending). |
| `backend.maxInFlight` | `long` | `CounterSnapshot.MaxInFlight` | High-water mark of `inFlight` since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests. |
| `backend.txIdWraps` | `long` | `CounterSnapshot.TxIdWraps` | Times the 16-bit MBAP transaction-id allocator has wrapped through `0xFFFF`. A rising rate quantifies sustained request volume. |
| `backend.disconnectCascades` | `long` | `CounterSnapshot.BackendDisconnectCascades` | Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC. |
| `backend.queueDepth` | `long` | `CounterSnapshot.BackendQueueDepth` | Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand). |
### Coalescing counters
These fields describe duplicate-read coalescing on FC03/FC04. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) for the matching criteria and lifecycle.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.coalescedHitCount` | `long` | `CounterSnapshot.CoalescedHitCount` | Reads that attached to an already-in-flight identical read instead of issuing a new backend request. |
| `backend.coalescedMissCount` | `long` | `CounterSnapshot.CoalescedMissCount` | Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is `hit / (hit + miss)`; the wire format intentionally does **not** carry the derived ratio (consumers compute it). |
| `backend.coalescedResponseToDeadUpstream` | `long` | `CounterSnapshot.CoalescedResponseToDeadUpstream` | Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly. |
### Cache counters
These fields describe the short-TTL response cache for FC03/FC04. See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.cacheHitCount` | `long` | `CounterSnapshot.CacheHitCount` | Reads served from the cache without touching the backend at all. |
| `backend.cacheMissCount` | `long` | `CounterSnapshot.CacheMissCount` | Cache-eligible reads that fell through to the backend. The derived `cacheHitRatio` is `hit / (hit + miss)`; like coalescing, it is **not** carried on the wire. |
| `backend.cacheInvalidations` | `long` | `CounterSnapshot.CacheInvalidations` | Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work. |
### Cache memory-watch
These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
### Bytes
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `bytes.upstreamIn` | `long` | `CounterSnapshot.BytesUpstreamIn` | Total bytes read from upstream client sockets bound to this PLC since start. |
| `bytes.upstreamOut` | `long` | `CounterSnapshot.BytesUpstreamOut` | Total bytes written back to upstream client sockets bound to this PLC since start. |
## Counter Atomicity
All counters are `System.Threading.Interlocked` longs. Each read in `StatusSnapshotBuilder.Build()` is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O.
The practical consequence: a single `/status.json` request returns a coherent value for any **one** counter, but the assembled response is **not** a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, `pdus.forwarded` for PLC A and `pdus.forwarded` for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting.
## Example JSON Response
A representative two-PLC deployment, ~2 hours into a run:
```json
{
"service": {
"uptimeSeconds": 7234,
"version": "1.0.0",
"configLastReloadUtc": "2026-05-13T14:02:11+00:00",
"configReloadCount": 2,
"configReloadRejectedCount": 0
},
"listeners": {
"bound": 2,
"configured": 2
},
"plcs": [
{
"name": "line1-press",
"host": "10.20.30.41",
"listenPort": 5021,
"listener": {
"state": "bound",
"lastBindError": null,
"recoveryAttempts": 0
},
"clients": {
"connected": 2,
"remoteEndpoints": [
{
"remote": "10.20.40.10:51223",
"connectedAtUtc": "2026-05-13T12:01:55+00:00",
"pdusForwarded": 184213
},
{
"remote": "10.20.40.11:53901",
"connectedAtUtc": "2026-05-13T13:30:02+00:00",
"pdusForwarded": 41008
}
]
},
"pdus": {
"forwarded": 225221,
"byFc": {
"fc03": 218904,
"fc04": 0,
"fc06": 12,
"fc16": 6203,
"other": 102
},
"rewrittenSlots": 1318622,
"partialBcdWarnings": 0
},
"backend": {
"connectsSuccess": 2,
"connectsFailed": 0,
"exceptionsByCode": {
"code01": 0,
"code02": 14,
"code03": 0,
"code04": 0
},
"lastRoundTripMs": 12.4,
"inFlight": 1,
"maxInFlight": 4,
"txIdWraps": 3,
"disconnectCascades": 0,
"queueDepth": 0,
"coalescedHitCount": 41892,
"coalescedMissCount": 177012,
"coalescedResponseToDeadUpstream": 7,
"cacheHitCount": 88321,
"cacheMissCount": 88691,
"cacheInvalidations": 6203,
"cacheEntryCount": 47,
"cacheBytes": 18512
},
"bytes": {
"upstreamIn": 4108290,
"upstreamOut": 12993021
}
},
{
"name": "line2-oven",
"host": "10.20.30.42",
"listenPort": 5022,
"listener": {
"state": "recovering",
"lastBindError": "Address already in use",
"recoveryAttempts": 12
},
"clients": {
"connected": 0,
"remoteEndpoints": []
},
"pdus": {
"forwarded": 0,
"byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 },
"rewrittenSlots": 0,
"partialBcdWarnings": 0
},
"backend": {
"connectsSuccess": 0,
"connectsFailed": 0,
"exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 },
"lastRoundTripMs": 0.0,
"inFlight": 0,
"maxInFlight": 0,
"txIdWraps": 0,
"disconnectCascades": 0,
"queueDepth": 0,
"coalescedHitCount": 0,
"coalescedMissCount": 0,
"coalescedResponseToDeadUpstream": 0,
"cacheHitCount": 0,
"cacheMissCount": 0,
"cacheInvalidations": 0,
"cacheEntryCount": 0,
"cacheBytes": 0
},
"bytes": { "upstreamIn": 0, "upstreamOut": 0 }
}
]
}
```
## HTML Page Layout
The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy/Admin/StatusHtmlRenderer.cs`. The page is one document, inline CSS in a `<style>` block, no external resources of any kind — operators can serve it behind a corporate firewall without whitelisting a CDN.
Structure:
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
## How to Scrape It
The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.
PowerShell, pulling the cache hit ratio for the first PLC into a variable:
```powershell
$snap = Invoke-WebRequest -Uri "http://mbproxy-host:8080/status.json" -UseBasicParsing |
Select-Object -ExpandProperty Content |
ConvertFrom-Json
$plc = $snap.plcs[0]
$hits = $plc.backend.cacheHitCount
$total = $hits + $plc.backend.cacheMissCount
$ratio = if ($total -gt 0) { [math]::Round(100.0 * $hits / $total, 1) } else { 0.0 }
"PLC $($plc.name): cache hit ratio = $ratio% over $total reads"
```
Bash with `curl` and `jq`, fanning out across the fleet:
```bash
curl -s http://mbproxy-host:8080/status.json |
jq -r '.plcs[] | "\(.name)\t\(.listener.state)\t\(.backend.lastRoundTripMs)"'
```
Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.
## Where the KPIs Live
This document covers the **endpoint surface**: what is on the wire and how each field is computed. The **dashboard composition** — which counters roll up into which Grafana panels, alerting thresholds, fleet-aggregate definitions — lives in [`../kpi.md`](../kpi.md). Keep the two documents disjoint: when a new counter is added, list it here; when a new panel or rate calculation is added, add it to `kpi.md`.
## Related Documentation
- [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) — multiplexer counter meanings (`inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades`).
- [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) — coalescing counter meanings and matching criteria.
- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache counter meanings, TTL, invalidation rules.
- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — what increments `rewrittenSlots` and `partialBcdWarnings`.
- [`../Features/HotReload.md`](../Features/HotReload.md) — what increments `configReloadCount` vs. `configReloadRejectedCount`.
- [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys.
- [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes.
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`.
- [`../kpi.md`](../kpi.md) — dashboard catalog that consumes these counters.
+364
View File
@@ -0,0 +1,364 @@
# Troubleshooting
Operator diagnosis playbook for mbproxy. Each entry maps an observable symptom to the log event name and status-page counter that confirms it, then lists likely causes and remediation steps.
The rolling log lives at `C:\ProgramData\mbproxy\logs\mbproxy-<date>.log`. The live counters are at `http://<host>:<AdminPort>/status.json` (default port `8080`). Events at Error level and above are also mirrored to the Windows Application Event Log under source `mbproxy`.
## Service Startup Failures
### Listener port already in use
**Symptom.** Service starts but one or more PLCs show `listener.state = "recovering"` on the status page. `plcs[].listener.lastBindError` contains the OS error text (typically `Only one usage of each socket address ... is normally permitted`).
**Where to look.**
- Log event: `mbproxy.startup.bind.failed` (Error) with `Plc`, `Port`, `Reason` properties.
- Followed periodically by retries; success eventually logs `mbproxy.listener.recovered`.
- Status fields: `plcs[].listener.state`, `plcs[].listener.lastBindError`, `plcs[].listener.recoveryAttempts`.
**Root causes.**
- Another mbproxy instance is already running against the same `appsettings.json`.
- A stale `mbproxy` process is holding the port after a non-graceful stop.
- A different service (a previous Modbus gateway, a leftover test harness) is bound to the configured `ListenPort`.
**Remediation.**
1. Identify the process holding the port:
```powershell
netstat -ano | findstr :<port>
Get-Process -Id <pid>
```
2. Stop the conflicting process, or change `Plcs[].ListenPort` in `appsettings.json` and save. The supervisor retries on a Polly schedule (1s / 2s / 5s / 15s / 30s, then 30s indefinitely) — watch for `mbproxy.listener.recovered` to confirm.
3. If the listener never recovers, check the Event Log for the underlying reason text and verify the configured IP address is bound on the host (the proxy binds to the host's IPs, not the PLC's).
### Admin endpoint port collision
**Symptom.** Status page is unreachable; Modbus traffic continues to flow normally.
**Where to look.**
- Log event: `mbproxy.admin.bind.failed` (Error) with `Port`, `Reason` properties.
- The matching success event `mbproxy.admin.started` is absent from the same boot.
**Root causes.**
- Another HTTP service (IIS, a sidecar dashboard, a previous mbproxy instance) is bound to `AdminPort`.
- A firewall rule is rejecting the bind on the configured port.
**Remediation.**
1. The Modbus proxy continues to forward traffic — only telemetry is affected. There is no urgency from a traffic-flow perspective.
2. Identify the process holding the admin port with the same `netstat -ano | findstr :<port>` pattern.
3. Change `Mbproxy.AdminPort` in `appsettings.json` and save. Admin rebinding is hot-reloadable; no service restart is required.
### Malformed appsettings.json at startup
**Symptom.** Service fails to enter the `RUNNING` state. `sc.exe start mbproxy` reports a startup failure.
**Where to look.**
- Rolling log at `C:\ProgramData\mbproxy\logs\` for the most recent date — startup errors include the JSON parse exception with line/column.
- Windows Event Log under source `mbproxy` for the Error-level entry mirrored from the rolling log.
**Root causes.**
- Trailing comma, unbalanced braces, or stray comment in `appsettings.json`.
- A required section (`Plcs`, `BcdTags`) is missing or has the wrong shape.
- A field that must be an integer is quoted as a string.
**Remediation.**
1. Open `C:\ProgramData\mbproxy\appsettings.json` and validate it as JSON (use any editor with JSON linting).
2. Fix the structural error reported in the log and save.
3. Start the service with `sc.exe start mbproxy`.
## Connectivity Failures Between Proxy and PLC
### Backend connect refused
**Symptom.** Upstream clients can connect to the proxy but their reads/writes either return Modbus exception 0x0B or the proxy closes the client socket. `plcs[].backend.connectsFailed` on the status page rises while `connectsSuccess` stays flat.
**Where to look.**
- Log event: `mbproxy.backend.failed` (Warning) with `Plc`, `Reason`.
- Status fields: `plcs[].backend.connectsFailed`, `plcs[].backend.connectsSuccess`.
**Root causes.**
- PLC powered off, rebooting, or its ECOM/EBC coprocessor is faulted.
- Wrong `Host` or `Port` configured for the PLC in `appsettings.json`.
- A network ACL or firewall change is blocking the proxy host from reaching the PLC on TCP 502.
- The H2-ECOM100 already has its cap of 4 simultaneous TCP clients in use — the 5th connection is refused at the device.
**Remediation.**
1. Confirm the PLC is reachable from the proxy host:
```powershell
Test-NetConnection -ComputerName <plc-ip> -Port 502
```
2. Verify the host/port in `appsettings.json` matches the PLC's actual settings (see `DL260/mbtcp_settings.JPG` for the as-deployed values).
3. If `Test-NetConnection` succeeds but the proxy still fails, inspect the upstream client count for that PLC on the status page — if it is at 4 and a new connect attempt fires, the ECOM cap is the cause.
4. If the PLC has rebooted, the supervisor retries automatically on the Polly backend-connect pipeline (3 attempts at 100ms / 500ms / 2000ms per upstream request).
### Backend disconnect cascade
**Symptom.** All upstream clients for a single PLC disconnect at the same instant. `plcs[].backend.disconnectCascades` on the status page increments by the number of upstream clients that were attached at the time. Upstream clients reconnect on their next request.
**Where to look.**
- Log event: `mbproxy.multiplex.backend.disconnected` (Warning) with `Plc`, `UpstreamCount`, `InFlightCount`, `Reason`.
- Status field: `plcs[].backend.disconnectCascades`.
**Root causes.**
- PLC rebooted, reset its ECOM, or dropped the TCP session.
- A middlebox (switch, firewall) timed out the idle connection. The DL205/DL260 family does not emit TCP keepalive, so idle paths die silently after typically 25 minutes.
- A network event (link flap, switch reset) closed the path.
**Remediation.**
1. Verify the upstream count on the status page returns to normal as clients reconnect — `plcs[].clients.connected` should climb again within seconds.
2. If cascades fire repeatedly against the same PLC, investigate the PLC and intermediate network for stability. The proxy itself has no state to repair.
3. If cascades correlate with idle periods, the idle middlebox-drop pattern is the likeliest cause; reduce the upstream client's poll interval below the middlebox idle timeout to keep traffic flowing.
### Request timeout watchdog firing
**Symptom.** Upstream clients receive Modbus exception `0x0B` (Gateway Target Device Failed To Respond) with the original transaction ID preserved. The backend socket stays up — only individual requests time out.
**Where to look.**
- Log event: `mbproxy.multiplex.request.timeout` (Warning) with `Plc`, `ProxyTxId`, `OriginalTxId`, `Fc`, `ElapsedMs`.
- Status field: `plcs[].backend.lastRoundTripMs` (EWMA over recent successful round-trips — climbs as the PLC slows down).
**Root causes.**
- PLC scan time has grown beyond `Connection.BackendRequestTimeoutMs` (default 3000) under load.
- A PLC firmware quirk is dropping responses or echoing the wrong MBAP transaction ID.
- In test environments only, pymodbus 3.13.0's concurrent-multiplexed-request bug delivers the response under a different `OriginalTxId` than was sent — see [`../Testing/Simulator.md`](../Testing/Simulator.md).
**Remediation.**
1. Confirm the PLC is healthy — the EWMA in `plcs[].backend.lastRoundTripMs` should sit well below the configured timeout. If it is creeping up, the PLC itself is overloaded.
2. If the PLC's scan time legitimately exceeds the default, raise `Connection.BackendRequestTimeoutMs`. The change is hot-reloadable; the next request uses the new value.
3. The proxy does not retry timed-out FC06 / FC16 — they are non-idempotent on BCD tags and a partial-applied multi-register write could leave a 32-bit pair mid-transition. Upstream clients are responsible for their own retry policy.
## Configuration Hot-Reload Failures
### Reload rejected
**Symptom.** A save to `%ProgramData%\mbproxy\appsettings.json` is not picked up. The running config stays in effect. `service.configReloadRejectedCount` on the status page increments by one; `service.configLastReloadUtc` does not advance.
**Where to look.**
- Log event: `mbproxy.config.reload.rejected` (Error) with the joined `Errors` string.
- Status fields: `service.configReloadCount`, `service.configReloadRejectedCount`, `service.configLastReloadUtc`.
**Root causes.**
- Malformed JSON (parse error).
- Duplicate `Plcs[].ListenPort` across two PLCs.
- Duplicate BCD `Address` within one tag list.
- A 32-bit BCD pair whose high register overlaps with a separate 16-bit entry at the next address.
- A `CacheTtlMs` (or per-PLC `DefaultCacheTtlMs`) exceeding 60 000 ms without `Cache.AllowLongTtl = true` to opt in.
- Schema error (a required field is missing or has the wrong type).
**Remediation.**
1. Read the `Errors` property on the rejection log event — it lists every validation failure for the rejected file as a single semicolon-separated string.
2. Fix the file and save again. The next valid write is accepted and `mbproxy.config.reload.applied` is logged.
3. Reload is all-or-nothing — there is no partial application. The previous valid config keeps running until the next valid write.
4. For the full validation rule set, see [`../Features/HotReload.md`](../Features/HotReload.md).
## BCD Rewriter Anomalies
### Partial-overlap warnings
**Symptom.** The status page's `plcs[].pdus.partialBcdWarnings` counter rises. Upstream clients see raw nibble values instead of decoded integers for a 32-bit BCD pair.
**Where to look.**
- Log event: `mbproxy.rewrite.partial_bcd` (Warning) with `Plc`, `Address`, `ClientStart`, `ClientQty`.
- Status field: `plcs[].pdus.partialBcdWarnings`.
**Root causes.**
- An upstream client requested quantity = 1 at the low address of a configured 32-bit BCD pair (the proxy will not rewrite half a pair).
- An upstream client read or wrote the high address of a 32-bit pair on its own.
- A client tag definition specifies the wrong word width for the configured BCD address.
**Remediation.**
1. Fix the client's tag definition: 32-bit BCD addresses must be accessed as quantity = 2 starting at the low address.
2. If the client genuinely intends to read half the pair (rare; not a normal workflow), remove the 32-bit entry from the BCD tag list and replace it with a single 16-bit entry at the address the client uses.
3. Background reading: [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) covers the rewriter pipeline and the partial-overlap policy.
### Invalid BCD values
**Symptom.** The status page's `plcs[].pdus` block is normal but `mbproxy.rewrite.invalid_bcd` warnings appear in the log. The affected register passes through raw.
**Where to look.**
- Log event: `mbproxy.rewrite.invalid_bcd` (Warning) with `Plc`, `Address`, `RawValue` (hex), `Direction` (`Read` or `Write`).
- Status field: invalid-BCD warnings are folded into `plcs[].pdus.partialBcdWarnings` only when the warning is partial; pure invalid-BCD events do not have a dedicated counter — log search is the primary diagnostic.
**Root causes.**
- The tag is misconfigured as BCD when the PLC is actually storing binary at that address. `0x1A2B` is not valid BCD because the nibble `A` is outside 09.
- The PLC ladder program wrote a non-BCD value to a register configured as a BCD tag (operator bug on the PLC side).
- An upstream client is writing a value the proxy cannot encode into BCD (out-of-range decimal — for example, decimal 9 999 999 into a 16-bit BCD slot whose maximum is 9999).
**Remediation.**
1. Look at the `RawValue` in the log event. If it consistently contains nibbles `A``F`, the tag is almost certainly not BCD at the PLC — remove it from the BCD tag list.
2. If the value is occasional and only appears under certain operator actions, inspect the PLC ladder program for code paths that write to that address.
3. If the warning fires on writes (`Direction=Write`), the upstream client is sending a binary integer the proxy cannot represent in BCD. Validate the client-side value range.
## Performance Anomalies
### Backend EWMA latency spiking
**Symptom.** `plcs[].backend.lastRoundTripMs` on the status page climbs from its normal range (typically a few milliseconds on a healthy ECOM) toward `Connection.BackendRequestTimeoutMs`. Eventually some requests start timing out and the request-timeout symptom kicks in.
**Where to look.**
- Status field: `plcs[].backend.lastRoundTripMs` (EWMA with α = 0.2 over recent successful round-trips).
- If timeouts begin, `mbproxy.multiplex.request.timeout` events appear.
**Root causes.**
- PLC is under unusually heavy ladder load and its Modbus scan slot is starving.
- Network congestion between the proxy host and the PLC.
- The PLC is sharing its ECOM module with other Modbus clients (the proxy is not the only consumer).
**Remediation.**
1. Check the PLC's own diagnostics for scan-time growth or watchdog warnings.
2. Verify the proxy is not the only consumer — if the ECOM is also serving another upstream tool, the two are contending for the same serialised processing slot.
3. If the EWMA stabilises at a higher-but-still-safe value, consider raising `Connection.BackendRequestTimeoutMs` so individual slow responses do not start timing out.
### Multiplexer queue depth growing
**Symptom.** `plcs[].backend.queueDepth` on the status page is non-zero and trending up rather than oscillating near zero. The backend is being asked for more frames per unit time than it can serialise.
**Where to look.**
- Status field: `plcs[].backend.queueDepth` (current depth of the outbound channel feeding the backend writer task).
- Cross-reference: `plcs[].clients.connected` (upstream demand) and `plcs[].backend.lastRoundTripMs` (backend service rate).
**Root causes.**
- More concurrent upstream clients are issuing reads than the ECOM's serialised loop can satisfy. The DL205/DL260 family processes Modbus requests one at a time.
- A burst of large FC03/FC04 quantities is consuming wire time per request.
**Remediation.**
1. Enable the Phase-10 read coalescer if it is off: set `Resilience.ReadCoalescing.Enabled = true` in `appsettings.json`. Overlapping FC03/FC04 reads on the same address range fan out from a single backend round-trip — see [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md).
2. Opt heavy-read tags into the response cache by setting `CacheTtlMs > 0` per tag (or `DefaultCacheTtlMs` per PLC). Cache hits never touch the backend — see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
3. Reduce upstream client poll rates against the affected PLC if neither feature is appropriate.
4. Background reading on the per-PLC connection model and the queue's role: [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md).
### Coalescing dead-upstream count rising
**Symptom.** `plcs[].backend.coalescedResponseToDeadUpstream` rises steadily. Coalesced reads complete on the backend, but the upstream client that asked for the read has already disconnected before the response is fanned out.
**Where to look.**
- Log event: `mbproxy.coalesce.dead_upstream` (Debug) with `Plc`, `UnitId`, `Fc`, `Start`, `Qty`.
- Status field: `plcs[].backend.coalescedResponseToDeadUpstream`.
**Root causes.**
- Upstream Modbus clients are configured with a read timeout shorter than the backend's actual response time. They disconnect and reconnect before the response arrives.
- Upstream clients are deliberately short-lived (single-shot pollers that connect, send one request, close).
- Network instability is killing upstream sockets mid-request.
**Remediation.**
1. This metric is informational and often benign. A small steady rate against short-lived pollers is expected.
2. If the rate is high enough to matter, verify upstream client (NModbus, generic Modbus clients) read timeouts exceed `Connection.BackendRequestTimeoutMs` plus a margin for jitter.
3. Cross-check `plcs[].backend.lastRoundTripMs` — if the backend is genuinely slow, the dead-upstream metric is a follow-on symptom of the latency-spike entry above; address that first.
## Response Cache Anomalies
### Cache hit ratio low when expected high
**Symptom.** `plcs[].backend.cacheHitCount` is not rising even though the tag was opted into the cache. Reads are still hitting the backend.
**Where to look.**
- Status fields: `plcs[].backend.cacheHitCount`, `plcs[].backend.cacheMissCount`, `plcs[].backend.cacheEntryCount`.
- Log event: `mbproxy.cache.miss` (Debug) — turn the log level up to confirm misses are firing for the expected addresses.
**Root causes.**
- The tag's `CacheTtlMs` is unset (null) and the per-PLC `DefaultCacheTtlMs` is `0` (the default). Cache is opt-in; absent a positive TTL, every read misses.
- The last config reload was rejected, so the cache TTL change never took effect. Check `service.configReloadRejectedCount`.
- Writes to the same address range are arriving fast enough to invalidate every cached entry before it is reused. Inspect `cacheInvalidations` next.
**Remediation.**
1. Confirm the configured cache fields appear in `/status.json` for the PLC. If `cacheEntryCount` is `0` after a sustained read load, the cache is not wired for that tag.
2. Verify the most recent `service.configLastReloadUtc` matches the time you saved the file. If not, the reload was rejected — see the "Reload rejected" entry above.
3. For the cache wiring rules and per-tag-versus-per-PLC precedence, see [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
### Cache invalidations storming
**Symptom.** `plcs[].backend.cacheInvalidations` rises at a rate close to the read rate. Cache hits are happening but each one is followed quickly by an invalidation.
**Where to look.**
- Status fields: `plcs[].backend.cacheInvalidations`, `plcs[].backend.cacheHitCount`, `plcs[].pdus.byFc.fc06`, `plcs[].pdus.byFc.fc16`.
- Log event: `mbproxy.cache.invalidated` (Debug) with `Plc`, `UnitId`, `WriteStart`, `WriteQty`, `Count`.
**Root causes.**
- Frequent FC06 / FC16 writes target the same address range as the cached reads. The cache invalidates correctly on every overlapping write — but if writes outpace reads, the cache provides little net benefit.
- The cached read range is larger than necessary and overlaps with unrelated write traffic.
- The TTL is high relative to the natural data churn rate at the PLC.
**Remediation.**
1. Compare `cacheInvalidations` to `cacheHitCount`. When invalidation rate approaches read rate, the cache is doing its job but is not buying anything.
2. Lower the TTL on the affected tag (or remove it from the cache entirely).
3. Verify the cached read range matches only the addresses the upstream actually needs — narrower ranges reduce overlap with write traffic.
## Service Stop / Restart
### Service will not stop cleanly within the graceful drain
**Symptom.** `sc.exe stop mbproxy` takes the full `Connection.GracefulShutdownTimeoutMs` (default 10 000) or longer to return. The shutdown log line indicates non-zero in-flight requests at cancel time.
**Where to look.**
- Log event: `mbproxy.shutdown.complete` (Information) with `InFlightAtCancel`, `ElapsedMs`.
- Windows Event Log for any Error-level events emitted during the shutdown window.
**Root causes.**
- `Connection.GracefulShutdownTimeoutMs` is shorter than the slowest in-flight request can complete in.
- An in-flight request is stuck because the backend is unresponsive — the request will never return; only the deadline ends it.
- The fleet is in a sustained busy state at the moment of stop, with many in-flight requests, and they cannot all complete within the configured budget.
**Remediation.**
1. Inspect `InFlightAtCancel` on the shutdown log line. Zero means the drain succeeded; non-zero means that many requests were cancelled by the deadline.
2. Raise `Connection.GracefulShutdownTimeoutMs` if a slow-but-healthy backend is the cause. The change applies on the next `ApplicationStopping` event — restart the service to pick it up.
3. If non-zero `InFlightAtCancel` correlates with `mbproxy.multiplex.request.timeout` events in the same window, the backend was unresponsive at stop time and no timeout extension would have helped — the proxy correctly proceeded with shutdown.
4. The drain phase cancels remaining work cleanly; the service always reaches `STOPPED`. Persistent failure to reach `STOPPED` indicates a Windows service-control issue, not an mbproxy issue.
## Related Documentation
- [Status page](./StatusPage.md)
- [Configuration](./Configuration.md)
- [Log event catalogue](../Reference/LogEvents.md)
- [Connection model](../Architecture/ConnectionModel.md)
- [Response cache](../Architecture/ResponseCache.md)
- [Hot reload validation rules](../Features/HotReload.md)
- [BCD rewriting](../Features/BcdRewriting.md)
- [Read coalescing](../Architecture/ReadCoalescing.md)
- [pymodbus simulator quirks](../Testing/Simulator.md)