Files
wwtools/mbproxy/docs/Operations/StatusPage.md
T
Joseph Doherty f49e27e316 mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:44:34 -04:00

335 lines
22 KiB
Markdown

# Status Page
The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at `GET /` and a JSON twin at `GET /status.json` that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions.
## Endpoint Surface
The admin endpoint is owned by `AdminEndpointHost` (see `src/Mbproxy/Admin/AdminEndpointHost.cs`). It exposes exactly two routes:
- `GET /` — a single self-contained HTML document with a `<meta http-equiv="refresh" content="5">` tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch.
- `GET /status.json` — the same in-memory snapshot serialized as JSON via the source-generated `StatusJsonContext` (camelCase property names).
The endpoint is **read-only**. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via `IOptionsMonitor`; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to `IPAddress.Any` on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall.
Both routes call `StatusSnapshotBuilder.Build()` for every request. The builder reads atomic counters directly from the supervisor map and per-PLC `ProxyCounters`; it holds no locks and performs no I/O.
## Port and Configuration
The listen port is read from `Mbproxy.AdminPort` and defaults to `8080`. Configuration semantics for this key live in [`./Configuration.md`](./Configuration.md).
If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs `mbproxy.admin.bind.failed` at `Error` level with the underlying reason. The host then sets `_app = null` and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See [`../Reference/LogEvents.md`](../Reference/LogEvents.md) for the event-id catalogue.
If `Mbproxy.AdminPort` changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint.
## Service-Wide Fields
Top-level fields come from `ServiceFields` and `ListenersAggregate` in `src/Mbproxy/Admin/StatusDto.cs`.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `service.uptimeSeconds` | `long` | `ServiceFields.UptimeSeconds` | Seconds since process start, computed as `now - ServiceCounters.StartedAtUtc` at snapshot time. |
| `service.version` | `string` | `ServiceFields.Version` via `AssemblyVersionAccessor` | `AssemblyInformationalVersion` of the running assembly. Useful for confirming a deployment took effect. |
| `service.configLastReloadUtc` | `DateTimeOffset?` | `ServiceCounters.LastReloadUtc` | Wall-clock time of the most recent **accepted** hot-reload. `null` if no reload has occurred since process start. See [`../Features/HotReload.md`](../Features/HotReload.md). |
| `service.configReloadCount` | `int` | `ServiceCounters.ReloadAppliedCount` | Number of `appsettings.json` reloads that validated and applied since process start. |
| `service.configReloadRejectedCount` | `int` | `ServiceCounters.ReloadRejectedCount` | Number of reload attempts rejected by validation. A non-zero value here paired with a stale `configLastReloadUtc` indicates the operator's last edit was malformed and the service is still running the previous config. |
| `listeners.bound` | `int` | `boundCount` accumulated while iterating `opts.Plcs` | Count of PLC entries whose supervisor currently reports `SupervisorState.Bound`. |
| `listeners.configured` | `int` | `opts.Plcs.Count` | Total number of PLC entries in the active configuration. |
Operator triggers:
- `listeners.bound < listeners.configured` for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC `listener.state` and `listener.lastBindError` fields below.
- `configReloadRejectedCount` rising means edits are reaching the watcher but failing validation — check the live log for `mbproxy.config.reload.rejected`.
## Per-PLC Fields
Each entry in `plcs[]` is a `PlcStatus` (see `src/Mbproxy/Admin/StatusDto.cs`). The builder iterates `opts.Plcs` in configured order, looks up the matching supervisor in `ProxyWorker.Supervisors`, and projects the supervisor's `CurrentCounters.Snapshot()` into wire fields.
### Identity
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `name` | `string` | `PlcOptions.Name` | Stable identifier from `appsettings.json`. Used as the dictionary key for supervisor lookup. |
| `host` | `string` | `PlcOptions.Host` | Backend PLC host (IP or DNS name) the proxy connects out to. |
| `listenPort` | `int` | `PlcOptions.ListenPort` | Local TCP port the proxy binds for upstream clients connecting *to* the proxy. |
### Listener state
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `listener.state` | `string` | `SupervisorSnapshot.State` mapped to `"bound"` / `"recovering"` / `"stopped"` | Current supervisor state. `bound` = TCP listener is accepting connections; `recovering` = Polly retry loop is trying to re-bind after a fault; `stopped` = no supervisor entry (typically a PLC that was just added and not yet started). |
| `listener.lastBindError` | `string?` | `SupervisorSnapshot.LastBindError` | Message from the last bind exception. Populated whenever `state == "recovering"`. Common values: `"Address already in use"`, `"Permission denied"`. |
| `listener.recoveryAttempts` | `int` | `SupervisorSnapshot.RecoveryAttempts` | Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent. |
### Client tracking
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `clients.connected` | `int` | `clientSnapshots.Count` | Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC. |
| `clients.remoteEndpoints[].remote` | `string` | `UpstreamPipe.RemoteEp` | Upstream TCP endpoint as `ip:port`. |
| `clients.remoteEndpoints[].connectedAtUtc` | `DateTimeOffset` | `UpstreamPipe.ConnectedAtUtc` | Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage. |
| `clients.remoteEndpoints[].pdusForwarded` | `long` | `UpstreamPipe.PdusForwardedCount` | PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic. |
### PDU traffic
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `pdus.forwarded` | `long` | `CounterSnapshot.PdusForwarded` | Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter. |
| `pdus.byFc.fc03` | `long` | `CounterSnapshot.Fc03` | Count of FC03 (read holding registers) requests seen. |
| `pdus.byFc.fc04` | `long` | `CounterSnapshot.Fc04` | Count of FC04 (read input registers) requests seen. |
| `pdus.byFc.fc06` | `long` | `CounterSnapshot.Fc06` | Count of FC06 (write single register) requests seen. |
| `pdus.byFc.fc16` | `long` | `CounterSnapshot.Fc16` | Count of FC16 (write multiple registers) requests seen. |
| `pdus.byFc.other` | `long` | `CounterSnapshot.FcOther` | All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched. |
| `pdus.rewrittenSlots` | `long` | `CounterSnapshot.RewrittenSlots` | Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md). |
| `pdus.partialBcdWarnings` | `long` | `CounterSnapshot.PartialBcdWarnings` | Count of requests whose `[start, qty)` range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape. |
### Backend health
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.connectsSuccess` | `long` | `CounterSnapshot.ConnectsSuccess` | Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client). |
| `backend.connectsFailed` | `long` | `CounterSnapshot.ConnectsFailed` | Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap. |
| `backend.exceptionsByCode.code01` | `long` | `CounterSnapshot.BackendException01` | Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support. |
| `backend.exceptionsByCode.code02` | `long` | `CounterSnapshot.BackendException02` | Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map. |
| `backend.exceptionsByCode.code03` | `long` | `CounterSnapshot.BackendException03` | Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100). |
| `backend.exceptionsByCode.code04` | `long` | `CounterSnapshot.BackendException04` | Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode. |
| `backend.lastRoundTripMs` | `double` | `CounterSnapshot.LastRoundTripMs` | Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation. |
### Multiplexer state
These five fields describe the per-PLC backend multiplexer. See [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) for the design rationale and how transaction-id (TxId) reuse and queueing work.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.inFlight` | `long` | `CounterSnapshot.InFlightCount` | Number of MBAP transactions currently in flight on the backend socket (request sent, response pending). |
| `backend.maxInFlight` | `long` | `CounterSnapshot.MaxInFlight` | High-water mark of `inFlight` since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests. |
| `backend.txIdWraps` | `long` | `CounterSnapshot.TxIdWraps` | Times the 16-bit MBAP transaction-id allocator has wrapped through `0xFFFF`. A rising rate quantifies sustained request volume. |
| `backend.disconnectCascades` | `long` | `CounterSnapshot.BackendDisconnectCascades` | Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC. |
| `backend.queueDepth` | `long` | `CounterSnapshot.BackendQueueDepth` | Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand). |
### Coalescing counters
These fields describe duplicate-read coalescing on FC03/FC04. See [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) for the matching criteria and lifecycle.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.coalescedHitCount` | `long` | `CounterSnapshot.CoalescedHitCount` | Reads that attached to an already-in-flight identical read instead of issuing a new backend request. |
| `backend.coalescedMissCount` | `long` | `CounterSnapshot.CoalescedMissCount` | Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is `hit / (hit + miss)`; the wire format intentionally does **not** carry the derived ratio (consumers compute it). |
| `backend.coalescedResponseToDeadUpstream` | `long` | `CounterSnapshot.CoalescedResponseToDeadUpstream` | Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly. |
### Cache counters
These fields describe the short-TTL response cache for FC03/FC04. See [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md).
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.cacheHitCount` | `long` | `CounterSnapshot.CacheHitCount` | Reads served from the cache without touching the backend at all. |
| `backend.cacheMissCount` | `long` | `CounterSnapshot.CacheMissCount` | Cache-eligible reads that fell through to the backend. The derived `cacheHitRatio` is `hit / (hit + miss)`; like coalescing, it is **not** carried on the wire. |
| `backend.cacheInvalidations` | `long` | `CounterSnapshot.CacheInvalidations` | Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work. |
### Cache memory-watch
These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet.
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `backend.cacheEntryCount` | `long` | `CounterSnapshot.CacheEntryCount` | Current number of cached response entries for this PLC. |
| `backend.cacheBytes` | `long` | `CounterSnapshot.CacheBytes` | Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client. |
### Bytes
| JSON path | Type | Source | Meaning |
|---|---|---|---|
| `bytes.upstreamIn` | `long` | `CounterSnapshot.BytesUpstreamIn` | Total bytes read from upstream client sockets bound to this PLC since start. |
| `bytes.upstreamOut` | `long` | `CounterSnapshot.BytesUpstreamOut` | Total bytes written back to upstream client sockets bound to this PLC since start. |
## Counter Atomicity
All counters are `System.Threading.Interlocked` longs. Each read in `StatusSnapshotBuilder.Build()` is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O.
The practical consequence: a single `/status.json` request returns a coherent value for any **one** counter, but the assembled response is **not** a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, `pdus.forwarded` for PLC A and `pdus.forwarded` for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting.
## Example JSON Response
A representative two-PLC deployment, ~2 hours into a run:
```json
{
"service": {
"uptimeSeconds": 7234,
"version": "1.0.0",
"configLastReloadUtc": "2026-05-13T14:02:11+00:00",
"configReloadCount": 2,
"configReloadRejectedCount": 0
},
"listeners": {
"bound": 2,
"configured": 2
},
"plcs": [
{
"name": "line1-press",
"host": "10.20.30.41",
"listenPort": 5021,
"listener": {
"state": "bound",
"lastBindError": null,
"recoveryAttempts": 0
},
"clients": {
"connected": 2,
"remoteEndpoints": [
{
"remote": "10.20.40.10:51223",
"connectedAtUtc": "2026-05-13T12:01:55+00:00",
"pdusForwarded": 184213
},
{
"remote": "10.20.40.11:53901",
"connectedAtUtc": "2026-05-13T13:30:02+00:00",
"pdusForwarded": 41008
}
]
},
"pdus": {
"forwarded": 225221,
"byFc": {
"fc03": 218904,
"fc04": 0,
"fc06": 12,
"fc16": 6203,
"other": 102
},
"rewrittenSlots": 1318622,
"partialBcdWarnings": 0
},
"backend": {
"connectsSuccess": 2,
"connectsFailed": 0,
"exceptionsByCode": {
"code01": 0,
"code02": 14,
"code03": 0,
"code04": 0
},
"lastRoundTripMs": 12.4,
"inFlight": 1,
"maxInFlight": 4,
"txIdWraps": 3,
"disconnectCascades": 0,
"queueDepth": 0,
"coalescedHitCount": 41892,
"coalescedMissCount": 177012,
"coalescedResponseToDeadUpstream": 7,
"cacheHitCount": 88321,
"cacheMissCount": 88691,
"cacheInvalidations": 6203,
"cacheEntryCount": 47,
"cacheBytes": 18512
},
"bytes": {
"upstreamIn": 4108290,
"upstreamOut": 12993021
}
},
{
"name": "line2-oven",
"host": "10.20.30.42",
"listenPort": 5022,
"listener": {
"state": "recovering",
"lastBindError": "Address already in use",
"recoveryAttempts": 12
},
"clients": {
"connected": 0,
"remoteEndpoints": []
},
"pdus": {
"forwarded": 0,
"byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 },
"rewrittenSlots": 0,
"partialBcdWarnings": 0
},
"backend": {
"connectsSuccess": 0,
"connectsFailed": 0,
"exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 },
"lastRoundTripMs": 0.0,
"inFlight": 0,
"maxInFlight": 0,
"txIdWraps": 0,
"disconnectCascades": 0,
"queueDepth": 0,
"coalescedHitCount": 0,
"coalescedMissCount": 0,
"coalescedResponseToDeadUpstream": 0,
"cacheHitCount": 0,
"cacheMissCount": 0,
"cacheInvalidations": 0,
"cacheEntryCount": 0,
"cacheBytes": 0
},
"bytes": { "upstreamIn": 0, "upstreamOut": 0 }
}
]
}
```
## HTML Page Layout
The HTML renderer is `StatusHtmlRenderer.Render(StatusResponse)` in `src/Mbproxy/Admin/StatusHtmlRenderer.cs`. The page is one document, inline CSS in a `<style>` block, no external resources of any kind — operators can serve it behind a corporate firewall without whitelisting a CDN.
Structure:
1. **Header summary** — version, formatted uptime (`Nh MMm SSs`), `bound/configured` listener tally, last reload timestamp, reload count with a `(N rejected)` suffix when applicable.
2. **PLC table** — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — `bound` = green, `recovering` = orange, `stopped` = grey), Clients (count plus a comma-separated list of `remote (N PDUs)`), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
3. **State cell error detail** — when `state == "recovering"`, the cell also shows `lastBindError` and `(attempt N)` in a small red span.
The coalescing and cache cells each render as `<pct>% (<hits>)`. When neither has been exercised (`hit + miss == 0`), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).
The page does not depend on JavaScript. Refresh is driven entirely by the `<meta http-equiv="refresh" content="5">` tag, so any browser — including text-mode browsers — sees the same view.
## How to Scrape It
The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.
PowerShell, pulling the cache hit ratio for the first PLC into a variable:
```powershell
$snap = Invoke-WebRequest -Uri "http://mbproxy-host:8080/status.json" -UseBasicParsing |
Select-Object -ExpandProperty Content |
ConvertFrom-Json
$plc = $snap.plcs[0]
$hits = $plc.backend.cacheHitCount
$total = $hits + $plc.backend.cacheMissCount
$ratio = if ($total -gt 0) { [math]::Round(100.0 * $hits / $total, 1) } else { 0.0 }
"PLC $($plc.name): cache hit ratio = $ratio% over $total reads"
```
Bash with `curl` and `jq`, fanning out across the fleet:
```bash
curl -s http://mbproxy-host:8080/status.json |
jq -r '.plcs[] | "\(.name)\t\(.listener.state)\t\(.backend.lastRoundTripMs)"'
```
Prometheus-style scrapers should poll `/status.json` directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.
## Where the KPIs Live
This document covers the **endpoint surface**: what is on the wire and how each field is computed. The **dashboard composition** — which counters roll up into which Grafana panels, alerting thresholds, fleet-aggregate definitions — lives in [`../kpi.md`](../kpi.md). Keep the two documents disjoint: when a new counter is added, list it here; when a new panel or rate calculation is added, add it to `kpi.md`.
## Related Documentation
- [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md) — multiplexer counter meanings (`inFlight`, `maxInFlight`, `txIdWraps`, `queueDepth`, `disconnectCascades`).
- [`../Architecture/ReadCoalescing.md`](../Architecture/ReadCoalescing.md) — coalescing counter meanings and matching criteria.
- [`../Architecture/ResponseCache.md`](../Architecture/ResponseCache.md) — cache counter meanings, TTL, invalidation rules.
- [`../Features/BcdRewriting.md`](../Features/BcdRewriting.md) — what increments `rewrittenSlots` and `partialBcdWarnings`.
- [`../Features/HotReload.md`](../Features/HotReload.md) — what increments `configReloadCount` vs. `configReloadRejectedCount`.
- [`./Configuration.md`](./Configuration.md) — `Mbproxy.AdminPort` and other option keys.
- [`./Troubleshooting.md`](./Troubleshooting.md) — using these counters to diagnose specific failure modes.
- [`../Reference/LogEvents.md`](../Reference/LogEvents.md) — event-id catalogue including `mbproxy.admin.bind.failed`.
- [`../kpi.md`](../kpi.md) — dashboard catalog that consumes these counters.