Files
wwtools/mbproxy/docs/Operations/StatusPage.md
T
Joseph Doherty 7466a46aa7 mbproxy/docs: retire superseded design/plan docs and dissolve DL260/
The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).

All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-15 07:37:48 -04:00

22 KiB

Status Page

The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at GET / and a JSON twin at GET /status.json that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions.

Endpoint Surface

The admin endpoint is owned by AdminEndpointHost (see src/Mbproxy/Admin/AdminEndpointHost.cs). It exposes exactly two routes:

  • GET / — a single self-contained HTML document with a <meta http-equiv="refresh" content="5"> tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch.
  • GET /status.json — the same in-memory snapshot serialized as JSON via the source-generated StatusJsonContext (camelCase property names).

The endpoint is read-only. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via IOptionsMonitor; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to IPAddress.Any on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall.

Both routes call StatusSnapshotBuilder.Build() for every request. The builder reads atomic counters directly from the supervisor map and per-PLC ProxyCounters; it holds no locks and performs no I/O.

Port and Configuration

The listen port is read from Mbproxy.AdminPort and defaults to 8080. Configuration semantics for this key live in ./Configuration.md.

If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs mbproxy.admin.bind.failed at Error level with the underlying reason. The host then sets _app = null and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See ../Reference/LogEvents.md for the event-id catalogue.

If Mbproxy.AdminPort changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint.

Service-Wide Fields

Top-level fields come from ServiceFields and ListenersAggregate in src/Mbproxy/Admin/StatusDto.cs.

JSON path Type Source Meaning
service.uptimeSeconds long ServiceFields.UptimeSeconds Seconds since process start, computed as now - ServiceCounters.StartedAtUtc at snapshot time.
service.version string ServiceFields.Version via AssemblyVersionAccessor AssemblyInformationalVersion of the running assembly. Useful for confirming a deployment took effect.
service.configLastReloadUtc DateTimeOffset? ServiceCounters.LastReloadUtc Wall-clock time of the most recent accepted hot-reload. null if no reload has occurred since process start. See ../Features/HotReload.md.
service.configReloadCount int ServiceCounters.ReloadAppliedCount Number of appsettings.json reloads that validated and applied since process start.
service.configReloadRejectedCount int ServiceCounters.ReloadRejectedCount Number of reload attempts rejected by validation. A non-zero value here paired with a stale configLastReloadUtc indicates the operator's last edit was malformed and the service is still running the previous config.
listeners.bound int boundCount accumulated while iterating opts.Plcs Count of PLC entries whose supervisor currently reports SupervisorState.Bound.
listeners.configured int opts.Plcs.Count Total number of PLC entries in the active configuration.

Operator triggers:

  • listeners.bound < listeners.configured for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC listener.state and listener.lastBindError fields below.
  • configReloadRejectedCount rising means edits are reaching the watcher but failing validation — check the live log for mbproxy.config.reload.rejected.

Per-PLC Fields

Each entry in plcs[] is a PlcStatus (see src/Mbproxy/Admin/StatusDto.cs). The builder iterates opts.Plcs in configured order, looks up the matching supervisor in ProxyWorker.Supervisors, and projects the supervisor's CurrentCounters.Snapshot() into wire fields.

Identity

JSON path Type Source Meaning
name string PlcOptions.Name Stable identifier from appsettings.json. Used as the dictionary key for supervisor lookup.
host string PlcOptions.Host Backend PLC host (IP or DNS name) the proxy connects out to.
listenPort int PlcOptions.ListenPort Local TCP port the proxy binds for upstream clients connecting to the proxy.

Listener state

JSON path Type Source Meaning
listener.state string SupervisorSnapshot.State mapped to "bound" / "recovering" / "stopped" Current supervisor state. bound = TCP listener is accepting connections; recovering = Polly retry loop is trying to re-bind after a fault; stopped = no supervisor entry (typically a PLC that was just added and not yet started).
listener.lastBindError string? SupervisorSnapshot.LastBindError Message from the last bind exception. Populated whenever state == "recovering". Common values: "Address already in use", "Permission denied".
listener.recoveryAttempts int SupervisorSnapshot.RecoveryAttempts Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent.

Client tracking

JSON path Type Source Meaning
clients.connected int clientSnapshots.Count Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC.
clients.remoteEndpoints[].remote string UpstreamPipe.RemoteEp Upstream TCP endpoint as ip:port.
clients.remoteEndpoints[].connectedAtUtc DateTimeOffset UpstreamPipe.ConnectedAtUtc Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage.
clients.remoteEndpoints[].pdusForwarded long UpstreamPipe.PdusForwardedCount PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic.

PDU traffic

JSON path Type Source Meaning
pdus.forwarded long CounterSnapshot.PdusForwarded Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter.
pdus.byFc.fc03 long CounterSnapshot.Fc03 Count of FC03 (read holding registers) requests seen.
pdus.byFc.fc04 long CounterSnapshot.Fc04 Count of FC04 (read input registers) requests seen.
pdus.byFc.fc06 long CounterSnapshot.Fc06 Count of FC06 (write single register) requests seen.
pdus.byFc.fc16 long CounterSnapshot.Fc16 Count of FC16 (write multiple registers) requests seen.
pdus.byFc.other long CounterSnapshot.FcOther All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched.
pdus.rewrittenSlots long CounterSnapshot.RewrittenSlots Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See ../Features/BcdRewriting.md.
pdus.partialBcdWarnings long CounterSnapshot.PartialBcdWarnings Count of requests whose [start, qty) range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape.

Backend health

JSON path Type Source Meaning
backend.connectsSuccess long CounterSnapshot.ConnectsSuccess Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client).
backend.connectsFailed long CounterSnapshot.ConnectsFailed Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap.
backend.exceptionsByCode.code01 long CounterSnapshot.BackendException01 Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support.
backend.exceptionsByCode.code02 long CounterSnapshot.BackendException02 Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map.
backend.exceptionsByCode.code03 long CounterSnapshot.BackendException03 Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100).
backend.exceptionsByCode.code04 long CounterSnapshot.BackendException04 Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode.
backend.lastRoundTripMs double CounterSnapshot.LastRoundTripMs Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation.

Multiplexer state

These five fields describe the per-PLC backend multiplexer. See ../Architecture/ConnectionModel.md for the design rationale and how transaction-id (TxId) reuse and queueing work.

JSON path Type Source Meaning
backend.inFlight long CounterSnapshot.InFlightCount Number of MBAP transactions currently in flight on the backend socket (request sent, response pending).
backend.maxInFlight long CounterSnapshot.MaxInFlight High-water mark of inFlight since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests.
backend.txIdWraps long CounterSnapshot.TxIdWraps Times the 16-bit MBAP transaction-id allocator has wrapped through 0xFFFF. A rising rate quantifies sustained request volume.
backend.disconnectCascades long CounterSnapshot.BackendDisconnectCascades Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC.
backend.queueDepth long CounterSnapshot.BackendQueueDepth Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand).

Coalescing counters

These fields describe duplicate-read coalescing on FC03/FC04. See ../Architecture/ReadCoalescing.md for the matching criteria and lifecycle.

JSON path Type Source Meaning
backend.coalescedHitCount long CounterSnapshot.CoalescedHitCount Reads that attached to an already-in-flight identical read instead of issuing a new backend request.
backend.coalescedMissCount long CounterSnapshot.CoalescedMissCount Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is hit / (hit + miss); the wire format intentionally does not carry the derived ratio (consumers compute it).
backend.coalescedResponseToDeadUpstream long CounterSnapshot.CoalescedResponseToDeadUpstream Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly.

Cache counters

These fields describe the short-TTL response cache for FC03/FC04. See ../Architecture/ResponseCache.md.

JSON path Type Source Meaning
backend.cacheHitCount long CounterSnapshot.CacheHitCount Reads served from the cache without touching the backend at all.
backend.cacheMissCount long CounterSnapshot.CacheMissCount Cache-eligible reads that fell through to the backend. The derived cacheHitRatio is hit / (hit + miss); like coalescing, it is not carried on the wire.
backend.cacheInvalidations long CounterSnapshot.CacheInvalidations Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work.

Cache memory-watch

These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet.

JSON path Type Source Meaning
backend.cacheEntryCount long CounterSnapshot.CacheEntryCount Current number of cached response entries for this PLC.
backend.cacheBytes long CounterSnapshot.CacheBytes Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client.

Bytes

JSON path Type Source Meaning
bytes.upstreamIn long CounterSnapshot.BytesUpstreamIn Total bytes read from upstream client sockets bound to this PLC since start.
bytes.upstreamOut long CounterSnapshot.BytesUpstreamOut Total bytes written back to upstream client sockets bound to this PLC since start.

Counter Atomicity

All counters are System.Threading.Interlocked longs. Each read in StatusSnapshotBuilder.Build() is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O.

The practical consequence: a single /status.json request returns a coherent value for any one counter, but the assembled response is not a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, pdus.forwarded for PLC A and pdus.forwarded for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting.

Example JSON Response

A representative two-PLC deployment, ~2 hours into a run:

{
  "service": {
    "uptimeSeconds": 7234,
    "version": "1.0.0",
    "configLastReloadUtc": "2026-05-13T14:02:11+00:00",
    "configReloadCount": 2,
    "configReloadRejectedCount": 0
  },
  "listeners": {
    "bound": 2,
    "configured": 2
  },
  "plcs": [
    {
      "name": "line1-press",
      "host": "10.20.30.41",
      "listenPort": 5021,
      "listener": {
        "state": "bound",
        "lastBindError": null,
        "recoveryAttempts": 0
      },
      "clients": {
        "connected": 2,
        "remoteEndpoints": [
          {
            "remote": "10.20.40.10:51223",
            "connectedAtUtc": "2026-05-13T12:01:55+00:00",
            "pdusForwarded": 184213
          },
          {
            "remote": "10.20.40.11:53901",
            "connectedAtUtc": "2026-05-13T13:30:02+00:00",
            "pdusForwarded": 41008
          }
        ]
      },
      "pdus": {
        "forwarded": 225221,
        "byFc": {
          "fc03": 218904,
          "fc04": 0,
          "fc06": 12,
          "fc16": 6203,
          "other": 102
        },
        "rewrittenSlots": 1318622,
        "partialBcdWarnings": 0
      },
      "backend": {
        "connectsSuccess": 2,
        "connectsFailed": 0,
        "exceptionsByCode": {
          "code01": 0,
          "code02": 14,
          "code03": 0,
          "code04": 0
        },
        "lastRoundTripMs": 12.4,
        "inFlight": 1,
        "maxInFlight": 4,
        "txIdWraps": 3,
        "disconnectCascades": 0,
        "queueDepth": 0,
        "coalescedHitCount": 41892,
        "coalescedMissCount": 177012,
        "coalescedResponseToDeadUpstream": 7,
        "cacheHitCount": 88321,
        "cacheMissCount": 88691,
        "cacheInvalidations": 6203,
        "cacheEntryCount": 47,
        "cacheBytes": 18512
      },
      "bytes": {
        "upstreamIn": 4108290,
        "upstreamOut": 12993021
      }
    },
    {
      "name": "line2-oven",
      "host": "10.20.30.42",
      "listenPort": 5022,
      "listener": {
        "state": "recovering",
        "lastBindError": "Address already in use",
        "recoveryAttempts": 12
      },
      "clients": {
        "connected": 0,
        "remoteEndpoints": []
      },
      "pdus": {
        "forwarded": 0,
        "byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 },
        "rewrittenSlots": 0,
        "partialBcdWarnings": 0
      },
      "backend": {
        "connectsSuccess": 0,
        "connectsFailed": 0,
        "exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 },
        "lastRoundTripMs": 0.0,
        "inFlight": 0,
        "maxInFlight": 0,
        "txIdWraps": 0,
        "disconnectCascades": 0,
        "queueDepth": 0,
        "coalescedHitCount": 0,
        "coalescedMissCount": 0,
        "coalescedResponseToDeadUpstream": 0,
        "cacheHitCount": 0,
        "cacheMissCount": 0,
        "cacheInvalidations": 0,
        "cacheEntryCount": 0,
        "cacheBytes": 0
      },
      "bytes": { "upstreamIn": 0, "upstreamOut": 0 }
    }
  ]
}

HTML Page Layout

The HTML renderer is StatusHtmlRenderer.Render(StatusResponse) in src/Mbproxy/Admin/StatusHtmlRenderer.cs. The page is one document, inline CSS in a <style> block, no external resources of any kind — operators can serve it behind a corporate firewall without whitelisting a CDN.

Structure:

  1. Header summary — version, formatted uptime (Nh MMm SSs), bound/configured listener tally, last reload timestamp, reload count with a (N rejected) suffix when applicable.
  2. PLC table — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — bound = green, recovering = orange, stopped = grey), Clients (count plus a comma-separated list of remote (N PDUs)), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
  3. State cell error detail — when state == "recovering", the cell also shows lastBindError and (attempt N) in a small red span.

The coalescing and cache cells each render as <pct>% (<hits>). When neither has been exercised (hit + miss == 0), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).

The page does not depend on JavaScript. Refresh is driven entirely by the <meta http-equiv="refresh" content="5"> tag, so any browser — including text-mode browsers — sees the same view.

How to Scrape It

The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.

PowerShell, pulling the cache hit ratio for the first PLC into a variable:

$snap = Invoke-WebRequest -Uri "http://mbproxy-host:8080/status.json" -UseBasicParsing |
        Select-Object -ExpandProperty Content |
        ConvertFrom-Json

$plc = $snap.plcs[0]
$hits  = $plc.backend.cacheHitCount
$total = $hits + $plc.backend.cacheMissCount
$ratio = if ($total -gt 0) { [math]::Round(100.0 * $hits / $total, 1) } else { 0.0 }

"PLC $($plc.name): cache hit ratio = $ratio% over $total reads"

Bash with curl and jq, fanning out across the fleet:

curl -s http://mbproxy-host:8080/status.json |
  jq -r '.plcs[] | "\(.name)\t\(.listener.state)\t\(.backend.lastRoundTripMs)"'

Prometheus-style scrapers should poll /status.json directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.

Scope of This Document

This document covers the endpoint surface: what is on the wire and how each field is computed. When a new counter is added, list it here.