Files

T

Joseph Doherty 7466a46aa7 mbproxy/docs: retire superseded design/plan docs and dissolve DL260/

The standalone design.md, kpi.md, operations.md, and the docs/plan/
phase tree were point-in-time planning artefacts now superseded by the
topic-organized docs/ tree (Architecture/, Features/, Operations/,
Reference/, Testing/). The DL260/ folder mixed a device-reference doc, a
test fixture, a sample test, and a screenshot; its contents now live in
their natural homes (dl205.md + mbtcp_settings.JPG under docs/Reference/,
dl205.json next to its launcher in tests/sim/, sample test dropped).

All cross-references in the surviving docs, README, CLAUDE.md, the config
template, and source comments are repointed to the new locations.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-15 07:37:48 -04:00

22 KiB

Raw Blame History

Status Page

The status page is the operator-facing view of the running service: an auto-refreshing HTML dashboard at GET / and a JSON twin at GET /status.json that monitoring scrapers consume. This document describes the endpoint surface, every wire-level field, and how counters map back to architecture decisions.

Endpoint Surface

The admin endpoint is owned by AdminEndpointHost (see src/Mbproxy/Admin/AdminEndpointHost.cs). It exposes exactly two routes:

GET / — a single self-contained HTML document with a <meta http-equiv="refresh" content="5"> tag. The page refreshes every five seconds by reload, not by JavaScript polling. There is no JS bundle, no external CSS, no remote fonts, and no favicon fetch.
GET /status.json — the same in-memory snapshot serialized as JSON via the source-generated StatusJsonContext (camelCase property names).

The endpoint is read-only. There are no admin actions exposed — no kick-client, no force-reload, no listener restart, no log download. Reload happens automatically via IOptionsMonitor; listener recovery is owned by the supervisor. Authentication lives at the network layer: the service binds to IPAddress.Any on the admin port and assumes the deployment runs in a trusted internal segment behind a firewall.

Both routes call StatusSnapshotBuilder.Build() for every request. The builder reads atomic counters directly from the supervisor map and per-PLC ProxyCounters; it holds no locks and performs no I/O.

Port and Configuration

The listen port is read from Mbproxy.AdminPort and defaults to 8080. Configuration semantics for this key live in ./Configuration.md.

If Kestrel cannot bind the configured port at startup (port already in use, missing permissions on a reserved range, etc.) the host logs mbproxy.admin.bind.failed at Error level with the underlying reason. The host then sets _app = null and returns — the rest of the service keeps running. The Modbus listener supervisors are completely independent of the admin endpoint, so a bind failure here is non-fatal for proxying. See ../Reference/LogEvents.md for the event-id catalogue.

If Mbproxy.AdminPort changes via hot-reload, the currently-running Kestrel app is stopped (2 s deadline) and a new one is started on the new port. Other config changes do not touch the admin endpoint.

Service-Wide Fields

Top-level fields come from ServiceFields and ListenersAggregate in src/Mbproxy/Admin/StatusDto.cs.

JSON path	Type	Source	Meaning
`service.uptimeSeconds`	`long`	`ServiceFields.UptimeSeconds`	Seconds since process start, computed as `now - ServiceCounters.StartedAtUtc` at snapshot time.
`service.version`	`string`	`ServiceFields.Version` via `AssemblyVersionAccessor`	`AssemblyInformationalVersion` of the running assembly. Useful for confirming a deployment took effect.
`service.configLastReloadUtc`	`DateTimeOffset?`	`ServiceCounters.LastReloadUtc`	Wall-clock time of the most recent accepted hot-reload. `null` if no reload has occurred since process start. See `../Features/HotReload.md`.
`service.configReloadCount`	`int`	`ServiceCounters.ReloadAppliedCount`	Number of `appsettings.json` reloads that validated and applied since process start.
`service.configReloadRejectedCount`	`int`	`ServiceCounters.ReloadRejectedCount`	Number of reload attempts rejected by validation. A non-zero value here paired with a stale `configLastReloadUtc` indicates the operator's last edit was malformed and the service is still running the previous config.
`listeners.bound`	`int`	`boundCount` accumulated while iterating `opts.Plcs`	Count of PLC entries whose supervisor currently reports `SupervisorState.Bound`.
`listeners.configured`	`int`	`opts.Plcs.Count`	Total number of PLC entries in the active configuration.

Operator triggers:

listeners.bound < listeners.configured for more than one refresh cycle indicates one or more listeners are stuck recovering. Drill into the per-PLC listener.state and listener.lastBindError fields below.
configReloadRejectedCount rising means edits are reaching the watcher but failing validation — check the live log for mbproxy.config.reload.rejected.

Per-PLC Fields

Each entry in plcs[] is a PlcStatus (see src/Mbproxy/Admin/StatusDto.cs). The builder iterates opts.Plcs in configured order, looks up the matching supervisor in ProxyWorker.Supervisors, and projects the supervisor's CurrentCounters.Snapshot() into wire fields.

Identity

JSON path	Type	Source	Meaning
`name`	`string`	`PlcOptions.Name`	Stable identifier from `appsettings.json`. Used as the dictionary key for supervisor lookup.
`host`	`string`	`PlcOptions.Host`	Backend PLC host (IP or DNS name) the proxy connects out to.
`listenPort`	`int`	`PlcOptions.ListenPort`	Local TCP port the proxy binds for upstream clients connecting to the proxy.

Listener state

JSON path	Type	Source	Meaning
`listener.state`	`string`	`SupervisorSnapshot.State` mapped to `"bound"` / `"recovering"` / `"stopped"`	Current supervisor state. `bound` = TCP listener is accepting connections; `recovering` = Polly retry loop is trying to re-bind after a fault; `stopped` = no supervisor entry (typically a PLC that was just added and not yet started).
`listener.lastBindError`	`string?`	`SupervisorSnapshot.LastBindError`	Message from the last bind exception. Populated whenever `state == "recovering"`. Common values: `"Address already in use"`, `"Permission denied"`.
`listener.recoveryAttempts`	`int`	`SupervisorSnapshot.RecoveryAttempts`	Number of bind retries since the supervisor entered recovery. Resets on a successful bind. A monotonically rising value indicates the underlying problem is persistent.

Client tracking

JSON path	Type	Source	Meaning
`clients.connected`	`int`	`clientSnapshots.Count`	Number of currently-connected upstream clients. Capped by the H2-ECOM100 four-client ceiling; values at 4 imply additional upstream connect attempts will be refused by the PLC.
`clients.remoteEndpoints[].remote`	`string`	`UpstreamPipe.RemoteEp`	Upstream TCP endpoint as `ip:port`.
`clients.remoteEndpoints[].connectedAtUtc`	`DateTimeOffset`	`UpstreamPipe.ConnectedAtUtc`	Wall-clock time the upstream socket was accepted. Useful for spotting zombie sockets that survived a network outage.
`clients.remoteEndpoints[].pdusForwarded`	`long`	`UpstreamPipe.PdusForwardedCount`	PDUs forwarded on this specific upstream pipe since it connected. Lets you see which client is responsible for what fraction of fleet traffic.

PDU traffic

JSON path	Type	Source	Meaning
`pdus.forwarded`	`long`	`CounterSnapshot.PdusForwarded`	Total PDUs (requests + responses) that traversed the proxy for this PLC since start. Increments once per PDU handed to the rewriter.
`pdus.byFc.fc03`	`long`	`CounterSnapshot.Fc03`	Count of FC03 (read holding registers) requests seen.
`pdus.byFc.fc04`	`long`	`CounterSnapshot.Fc04`	Count of FC04 (read input registers) requests seen.
`pdus.byFc.fc06`	`long`	`CounterSnapshot.Fc06`	Count of FC06 (write single register) requests seen.
`pdus.byFc.fc16`	`long`	`CounterSnapshot.Fc16`	Count of FC16 (write multiple registers) requests seen.
`pdus.byFc.other`	`long`	`CounterSnapshot.FcOther`	All other function codes (FC01/02/05/15, diagnostic codes, etc.) seen. The proxy forwards these untouched.
`pdus.rewrittenSlots`	`long`	`CounterSnapshot.RewrittenSlots`	Number of register slots the BCD rewriter touched, counting reads and writes. Indicates how much of the traffic actually hits BCD-configured addresses. See `../Features/BcdRewriting.md`.
`pdus.partialBcdWarnings`	`long`	`CounterSnapshot.PartialBcdWarnings`	Count of requests whose `[start, qty)` range partially overlapped a 32-bit BCD tag without fully covering its CDAB word pair. A rising value here is an operator signal: an upstream client is requesting partial-overlap reads, which the proxy cannot rewrite safely — review tag-list addresses or fix the client's request shape.

Backend health

JSON path	Type	Source	Meaning
`backend.connectsSuccess`	`long`	`CounterSnapshot.ConnectsSuccess`	Successful backend TCP connects since start. Increments once per accepted upstream client (the proxy opens one backend socket per upstream client).
`backend.connectsFailed`	`long`	`CounterSnapshot.ConnectsFailed`	Failed backend TCP connects after the Polly retry budget is exhausted (3 attempts at 100/500/2000 ms). A rising counter means the backend host is unreachable or the PLC is at its connection cap.
`backend.exceptionsByCode.code01`	`long`	`CounterSnapshot.BackendException01`	Count of Modbus exception responses with code 01 (Illegal Function) received from the PLC. Typically indicates a client is sending function codes the PLC does not support.
`backend.exceptionsByCode.code02`	`long`	`CounterSnapshot.BackendException02`	Code 02 (Illegal Data Address) — the requested register range is out of the PLC's V-memory map.
`backend.exceptionsByCode.code03`	`long`	`CounterSnapshot.BackendException03`	Code 03 (Illegal Data Value) — quantity exceeds the PLC's per-FC cap (FC03/04 = 128 registers, FC16 = 100).
`backend.exceptionsByCode.code04`	`long`	`CounterSnapshot.BackendException04`	Code 04 (Server Device Failure) — internal PLC fault, often correlated with the PLC entering STOP mode.
`backend.lastRoundTripMs`	`double`	`CounterSnapshot.LastRoundTripMs`	Exponentially-weighted moving average of recent successful request → response round-trip times in milliseconds. Tracks PLC responsiveness; sustained values above the historical baseline indicate backend latency degradation.

Multiplexer state

These five fields describe the per-PLC backend multiplexer. See ../Architecture/ConnectionModel.md for the design rationale and how transaction-id (TxId) reuse and queueing work.

JSON path	Type	Source	Meaning
`backend.inFlight`	`long`	`CounterSnapshot.InFlightCount`	Number of MBAP transactions currently in flight on the backend socket (request sent, response pending).
`backend.maxInFlight`	`long`	`CounterSnapshot.MaxInFlight`	High-water mark of `inFlight` since start. Used to size the queue and to verify the multiplexer is in fact pipelining requests.
`backend.txIdWraps`	`long`	`CounterSnapshot.TxIdWraps`	Times the 16-bit MBAP transaction-id allocator has wrapped through `0xFFFF`. A rising rate quantifies sustained request volume.
`backend.disconnectCascades`	`long`	`CounterSnapshot.BackendDisconnectCascades`	Times a backend disconnect cascaded into closing all upstream pipes that were waiting on in-flight TxIds. Each cascade aborts every queued request bound for that PLC.
`backend.queueDepth`	`long`	`CounterSnapshot.BackendQueueDepth`	Current count of requests queued behind the multiplexer's TxId allocator and write semaphore. A sustained non-zero queue means the multiplexer is the bottleneck (backend slower than upstream demand).

Coalescing counters

These fields describe duplicate-read coalescing on FC03/FC04. See ../Architecture/ReadCoalescing.md for the matching criteria and lifecycle.

JSON path	Type	Source	Meaning
`backend.coalescedHitCount`	`long`	`CounterSnapshot.CoalescedHitCount`	Reads that attached to an already-in-flight identical read instead of issuing a new backend request.
`backend.coalescedMissCount`	`long`	`CounterSnapshot.CoalescedMissCount`	Reads that did not find a matching in-flight request and issued their own. The dashboard-side ratio is `hit / (hit + miss)`; the wire format intentionally does not carry the derived ratio (consumers compute it).
`backend.coalescedResponseToDeadUpstream`	`long`	`CounterSnapshot.CoalescedResponseToDeadUpstream`	Coalesced responses that arrived after their attached upstream pipe had closed. Normal in bursty traffic; sustained growth indicates upstream clients are aborting too quickly.

Cache counters

These fields describe the short-TTL response cache for FC03/FC04. See ../Architecture/ResponseCache.md.

JSON path	Type	Source	Meaning
`backend.cacheHitCount`	`long`	`CounterSnapshot.CacheHitCount`	Reads served from the cache without touching the backend at all.
`backend.cacheMissCount`	`long`	`CounterSnapshot.CacheMissCount`	Cache-eligible reads that fell through to the backend. The derived `cacheHitRatio` is `hit / (hit + miss)`; like coalescing, it is not carried on the wire.
`backend.cacheInvalidations`	`long`	`CounterSnapshot.CacheInvalidations`	Times a write (FC06/FC16) invalidated overlapping cache entries on this PLC. A high invalidation rate relative to writes means write coverage is broad and the cache is doing less work.

Cache memory-watch

These two fields are Tier-2 KPIs intended for memory-budget alerts. The cache is per-PLC; the dashboard aggregates these across the fleet.

JSON path	Type	Source	Meaning
`backend.cacheEntryCount`	`long`	`CounterSnapshot.CacheEntryCount`	Current number of cached response entries for this PLC.
`backend.cacheBytes`	`long`	`CounterSnapshot.CacheBytes`	Approximate byte cost of the cache entries (response payloads plus key overhead). Used to detect runaway growth from a chatty client.

Bytes

JSON path	Type	Source	Meaning
`bytes.upstreamIn`	`long`	`CounterSnapshot.BytesUpstreamIn`	Total bytes read from upstream client sockets bound to this PLC since start.
`bytes.upstreamOut`	`long`	`CounterSnapshot.BytesUpstreamOut`	Total bytes written back to upstream client sockets bound to this PLC since start.

Counter Atomicity

All counters are System.Threading.Interlocked longs. Each read in StatusSnapshotBuilder.Build() is atomic per field; no locks are held across the snapshot build, and the build itself does no I/O.

The practical consequence: a single /status.json request returns a coherent value for any one counter, but the assembled response is not a globally consistent snapshot — different per-PLC counters may straddle increments by microseconds. For example, pdus.forwarded for PLC A and pdus.forwarded for PLC B are not guaranteed to reflect the same instant. This is acceptable for dashboards and rate calculations; do not use these counters for fine-grained accounting.

Example JSON Response

A representative two-PLC deployment, ~2 hours into a run:

{
  "service": {
    "uptimeSeconds": 7234,
    "version": "1.0.0",
    "configLastReloadUtc": "2026-05-13T14:02:11+00:00",
    "configReloadCount": 2,
    "configReloadRejectedCount": 0
  },
  "listeners": {
    "bound": 2,
    "configured": 2
  },
  "plcs": [
    {
      "name": "line1-press",
      "host": "10.20.30.41",
      "listenPort": 5021,
      "listener": {
        "state": "bound",
        "lastBindError": null,
        "recoveryAttempts": 0
      },
      "clients": {
        "connected": 2,
        "remoteEndpoints": [
          {
            "remote": "10.20.40.10:51223",
            "connectedAtUtc": "2026-05-13T12:01:55+00:00",
            "pdusForwarded": 184213
          },
          {
            "remote": "10.20.40.11:53901",
            "connectedAtUtc": "2026-05-13T13:30:02+00:00",
            "pdusForwarded": 41008
          }
        ]
      },
      "pdus": {
        "forwarded": 225221,
        "byFc": {
          "fc03": 218904,
          "fc04": 0,
          "fc06": 12,
          "fc16": 6203,
          "other": 102
        },
        "rewrittenSlots": 1318622,
        "partialBcdWarnings": 0
      },
      "backend": {
        "connectsSuccess": 2,
        "connectsFailed": 0,
        "exceptionsByCode": {
          "code01": 0,
          "code02": 14,
          "code03": 0,
          "code04": 0
        },
        "lastRoundTripMs": 12.4,
        "inFlight": 1,
        "maxInFlight": 4,
        "txIdWraps": 3,
        "disconnectCascades": 0,
        "queueDepth": 0,
        "coalescedHitCount": 41892,
        "coalescedMissCount": 177012,
        "coalescedResponseToDeadUpstream": 7,
        "cacheHitCount": 88321,
        "cacheMissCount": 88691,
        "cacheInvalidations": 6203,
        "cacheEntryCount": 47,
        "cacheBytes": 18512
      },
      "bytes": {
        "upstreamIn": 4108290,
        "upstreamOut": 12993021
      }
    },
    {
      "name": "line2-oven",
      "host": "10.20.30.42",
      "listenPort": 5022,
      "listener": {
        "state": "recovering",
        "lastBindError": "Address already in use",
        "recoveryAttempts": 12
      },
      "clients": {
        "connected": 0,
        "remoteEndpoints": []
      },
      "pdus": {
        "forwarded": 0,
        "byFc": { "fc03": 0, "fc04": 0, "fc06": 0, "fc16": 0, "other": 0 },
        "rewrittenSlots": 0,
        "partialBcdWarnings": 0
      },
      "backend": {
        "connectsSuccess": 0,
        "connectsFailed": 0,
        "exceptionsByCode": { "code01": 0, "code02": 0, "code03": 0, "code04": 0 },
        "lastRoundTripMs": 0.0,
        "inFlight": 0,
        "maxInFlight": 0,
        "txIdWraps": 0,
        "disconnectCascades": 0,
        "queueDepth": 0,
        "coalescedHitCount": 0,
        "coalescedMissCount": 0,
        "coalescedResponseToDeadUpstream": 0,
        "cacheHitCount": 0,
        "cacheMissCount": 0,
        "cacheInvalidations": 0,
        "cacheEntryCount": 0,
        "cacheBytes": 0
      },
      "bytes": { "upstreamIn": 0, "upstreamOut": 0 }
    }
  ]
}

HTML Page Layout

The HTML renderer is StatusHtmlRenderer.Render(StatusResponse) in src/Mbproxy/Admin/StatusHtmlRenderer.cs. The page is one document, inline CSS in a <style> block, no external resources of any kind — operators can serve it behind a corporate firewall without whitelisting a CDN.

Structure:

Header summary — version, formatted uptime (Nh MMm SSs), bound/configured listener tally, last reload timestamp, reload count with a (N rejected) suffix when applicable.
PLC table — one row per configured PLC. Columns: Name, Host, Port, State (colour-coded — bound = green, recovering = orange, stopped = grey), Clients (count plus a comma-separated list of remote (N PDUs)), PDUs forwarded, FC03/FC04/FC06/FC16/FC? counts, BCD slots, Partial BCD, exception codes 01/02/03/04, RTT (ms), bytes in/out, multiplexer columns (in-flight, max in-flight, TxId wraps, cascades, queue), coalescing ratio cell, cache ratio cell.
State cell error detail — when state == "recovering", the cell also shows lastBindError and (attempt N) in a small red span.

The coalescing and cache cells each render as <pct>% (<hits>). When neither has been exercised (hit + miss == 0), the cell renders an em-dash to keep the column narrow. Page weight is bounded by the design budget (≤ 50 KB for a 54-PLC fleet).

The page does not depend on JavaScript. Refresh is driven entirely by the <meta http-equiv="refresh" content="5"> tag, so any browser — including text-mode browsers — sees the same view.

How to Scrape It

The JSON twin is plain HTTP. Any monitoring system that can curl an endpoint can scrape it.

PowerShell, pulling the cache hit ratio for the first PLC into a variable:

$snap = Invoke-WebRequest -Uri "http://mbproxy-host:8080/status.json" -UseBasicParsing |
        Select-Object -ExpandProperty Content |
        ConvertFrom-Json

$plc = $snap.plcs[0]
$hits  = $plc.backend.cacheHitCount
$total = $hits + $plc.backend.cacheMissCount
$ratio = if ($total -gt 0) { [math]::Round(100.0 * $hits / $total, 1) } else { 0.0 }

"PLC $($plc.name): cache hit ratio = $ratio% over $total reads"

Bash with curl and jq, fanning out across the fleet:

curl -s http://mbproxy-host:8080/status.json |
  jq -r '.plcs[] | "\(.name)\t\(.listener.state)\t\(.backend.lastRoundTripMs)"'

Prometheus-style scrapers should poll /status.json directly and translate fields into their own metric names; the service does not expose Prometheus exposition format.

Scope of This Document

This document covers the endpoint surface: what is on the wire and how each field is computed. When a new counter is added, list it here.

../Architecture/ConnectionModel.md — multiplexer counter meanings (inFlight, maxInFlight, txIdWraps, queueDepth, disconnectCascades).
../Architecture/ReadCoalescing.md — coalescing counter meanings and matching criteria.
../Architecture/ResponseCache.md — cache counter meanings, TTL, invalidation rules.
../Features/BcdRewriting.md — what increments rewrittenSlots and partialBcdWarnings.
../Features/HotReload.md — what increments configReloadCount vs. configReloadRejectedCount.
./Configuration.md — Mbproxy.AdminPort and other option keys.
./Troubleshooting.md — using these counters to diagnose specific failure modes.
../Reference/LogEvents.md — event-id catalogue including mbproxy.admin.bind.failed.

22 KiB Raw Blame History