wwtools/mbproxy/docs/kpi.md

# mbproxy — Dashboard KPI catalogue

Recommended additions to the `/status.json` and `/` admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's `/status.json` exposes raw cumulative counters; this doc describes what's typically *also* expected when those counters land in Grafana / Wonderware / a custom HMI.

**Scope.** This is a proposal, not a contract. The endpoint shape settled in [`design.md`](design.md) → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.

**Reading guide.** Each KPI has:
- **Name** — short identifier matching the proxy's existing camelCase convention.
- **Definition** — what the number means.
- **Source** — where the value comes from (existing counter, new counter, derived).
- **Widget** — typical dashboard visualisation.
- **Alert** — common threshold or anomaly rule (where applicable).
- **Effort** — implementation cost in hours (rough order-of-magnitude).

## What's exposed today (recap)

For context — every recommended addition below is *in addition to* this list. Today's `/status.json` carries:

| Group | Fields |
|-------|--------|
| Service | `uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount` |
| Listeners | `bound`, `configured` |
| Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` |
| Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) |
| Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` |
| Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` |
| Per-PLC bytes | `upstreamIn`, `upstreamOut` |

Counters are **cumulative since process start**. A restart resets them.

---

## Tier 1 — strongly recommended for production

These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."

### 1.1 Rate metrics (per-PLC and fleet-wide)

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.ratePerSec.last1m` | PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h |
| `pdus.ratePerSec.last5m` | Same over 5 min | Same buffer at 300 s | Sparkline | None | shared |
| `errors.ratePerMin` | Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute | Derived | Stat tile per PLC | > 10/min → page | 2 h |
| `bytes.ratePerSec.up` / `.down` | Bandwidth each direction | Derived from `bytesUpstreamIn/Out` deltas | Stacked area | None — informational | 2 h |
| `fleet.totalPdusPerSec` | Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h |

**Why this matters.** Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing `rate(pdus_forwarded[1m])` on a 54-row fleet is the single most informative widget on the dashboard.

**Implementation note.** Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in `/status.json` directly, add a per-PLC `Mbproxy.Proxy.RateTracker` with a fixed-size circular buffer of 60 one-second samples and expose `RatePerSec1m`, `RatePerSec5m`.

### 1.2 Latency percentiles (replacing the bare EWMA)

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.roundTripMs.p50` | Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h |
| `backend.roundTripMs.p95` | 95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared |
| `backend.roundTripMs.p99` | 99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared |
| `backend.roundTripMs.max1m` | Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared |

**Why this matters.** The existing `lastRoundTripMs` is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.

**Implementation note.** Use `Mbproxy.Proxy.LatencyReservoir` — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).

### 1.3 Per-PLC availability ratio

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `listener.boundRatio.last1h` | Fraction of time in `bound` state over last hour | New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h |
| `listener.boundRatio.sinceStart` | Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared |
| `listener.timeInRecoveringMs.last1h` | Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared |

**Why this matters.** `recoveryAttempts` tells you how many times something has flapped, but not how *much* downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.

**Implementation note.** Each `PlcListenerSupervisor` already has a state machine. Add a `StateDurationTracker` that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.

### 1.4 Liveness / staleness signals

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `pdus.lastForwardedUtc` | Wall time of the most recent forwarded PDU | New `_lastForwardedTimestamp` per PLC | Stat tile | `now - value > 5 min AND clients.connected > 0` → page | 1 h |
| `clients.lastActivityUtc` | Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h |
| `staleClients.count` | Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h |

**Why this matters.** Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with `clients.connected = 2` but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.

### 1.5 Service-wide fleet aggregates

These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `fleet.plcsHealthy` | Count of PLCs in `bound` state with no errors in last 5 min | Aggregate | Big number, green | < `listeners.configured - 2` → warn | 2 h |
| `fleet.plcsRecovering` | Count in `recovering` state | Aggregate | Big number, orange | > 0 → informational | shared |
| `fleet.plcsStopped` | Count in `stopped` state | Aggregate | Big number, grey | > 0 → page | shared |
| `fleet.plcsWithActiveErrors` | Count with `errors.ratePerMin > 0` | Aggregate | Big number, red | > 0 → page | shared |
| `fleet.totalClientsConnected` | Sum of `clients.connected` | Aggregate | Stat tile | None | 1 h |
| `fleet.totalRewrittenSlotsPerSec` | Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared |

**Why this matters.** A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.

### 1.6 Multiplexer state — **shipped in [Phase 9](plan/09-txid-multiplexing.md)**

The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.inFlightCount` | Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) |
| `backend.maxInFlight` | Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) |
| `backend.txIdWraps` | Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) |
| `backend.queueDepth` | Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) |
| `backend.disconnectCascades` | Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with `mbproxy.backend.failed` events | (in Phase 9 scope) |

**Why this matters.** Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's `lastRoundTripMs` measures wire latency only; queue depth reveals proxy-side backlog).

### 1.7 Read coalescing — **shipped in [Phase 10](plan/10-read-coalescing.md)**

Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. `coalescedHitCount + coalescedMissCount` equals total FC03/04 request count per snapshot — the math always balances.

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.coalescedHitCount` | FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescedMissCount` | FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
| `backend.coalescingRatio` | `Hit / (Hit + Miss)` over the trailing window | Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) |
| `backend.coalescedResponseToDeadUpstream` | Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) |

**Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.

### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)**

After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry).

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.cacheHitCount` | FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheMissCount` | FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
| `backend.cacheHitRatio` | `Hit / (Hit + Miss)` for cache-eligible reads | Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) |
| `backend.cacheInvalidations` | Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) |

**Why this matters.** Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.

---

## Tier 2 — nice-to-have

Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.

### 2.1 Connection-cap saturation warning

> **Status: superseded by [Phase 9](plan/09-txid-multiplexing.md).** This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is `backend.inFlightCount` (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. **Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.**

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `clients.atCapWarning` | Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap) | Derived | Cell highlight | True → warn | 1 h |
| `clients.atCapBlocked` | Boolean: `clients.connected >= 4` (cap reached) | Derived | Cell highlight | True → page | shared |

**Why this mattered (pre-Phase-9).** The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see [design.md](design.md) → "Connection model" and [DL260/dl205.md](../DL260/dl205.md) → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.

### 2.2 Error breakdown / heatmap

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `partialBcd.byClient` | Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h |
| `invalidBcd.byAddress` | Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h |
| `exceptions.byCodeRate` | Per-exception-code rate over 5 min | Derived from `exceptionsByCode.*` | Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h |

**Why this matters.** Once you've seen `partialBcdWarnings = 1247`, the next question is *which client* and *which tag*. Without dimensional breakdown, you have to ssh into the log file to find out.

### 2.3 Hot-reload cadence

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `config.reloadsPerHour` | Reload events per hour | Derived from `configReloadCount` | Sparkline | > 10/hr → unusual; misconfig loop? | 1 h |
| `config.lastReloadDelta` | Summary of what changed on last reload | Already in `mbproxy.config.reload.applied` event; surface here | Text snippet | None — informational | 2 h |

**Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.

### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)**

When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) |
| `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |

**Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.

### 2.4 Memory / process health

| KPI | Definition | Source | Widget | Alert | Effort |
|-----|------------|--------|--------|-------|--------|
| `process.workingSetMb` | `Process.GetCurrentProcess().WorkingSet64 / 1MB` | New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h |
| `process.gcCollections.gen0/1/2` | GC counts per generation | `GC.CollectionCount(n)` | Sparkline | Gen-2 frequency → memory pressure | 0.5 h |
| `process.threadCount` | `Process.Threads.Count` | New | Stat tile | > 200 → leak? | 0.5 h |

**Why this matters.** A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one `Process` API call, no perf overhead.

---

## Real-time updates via SignalR

Today's status surface is poll-based: the HTML page uses a 5-second `meta-refresh`, and Prometheus / custom HMI scrapers hit `/status.json` on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a **live fleet dashboard with many panels open**, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping `bound → recovering`) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.

**The recommendation is additive, not replacement.** Keep `/status.json` for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.

### Why this is cheap to add

The `Microsoft.AspNetCore.App` framework reference that Phase 07 added to the csproj **already includes `Microsoft.AspNetCore.SignalR`** — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on `Mbproxy.AdminPort`. No additional port, no additional listener supervision, no additional shutdown path.

### Architecture

```
                                  ┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐     │
ConfigReconciler ───────────┤     │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘     (background loop +                │
                                   immediate-push paths)            └─→ Dashboard C (subscribed to "service")
```

- **`StatusHub : Hub`** — the SignalR endpoint mounted at `/hub/status` on `AdminPort`. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.
- **`StatusBroadcaster : IHostedService`** — the background pusher. Holds a `Timer` (or `PeriodicTimer`) that ticks at `PushIntervalMs` (default 1000 ms), builds a `StatusResponse` via the existing `StatusSnapshotBuilder`, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes `PushEventAsync(name, props)` for the immediate-push paths.
- **Immediate-push wiring** — the existing log events (`mbproxy.listener.recovered`, `mbproxy.config.reload.applied`, `mbproxy.backend.failed`, `mbproxy.rewrite.partial_bcd`, etc.) gain a fan-out call to `broadcaster.PushEventAsync(...)` so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.

### Hub contract

**Hub URL:** `https://<host>:<AdminPort>/hub/status`

**Hub groups** — clients subscribe to scopes; the server broadcasts to matching groups:

| Group | Receives |
|-------|----------|
| `all` | Every update for every PLC + every service-level event |
| `service` | Service-level events only (`mbproxy.config.*`, `mbproxy.admin.*`, `mbproxy.startup.*`, `mbproxy.shutdown.*`) |
| `plc:<Name>` | One PLC's snapshots + that PLC's events |

**Server-side methods** (client → server):

| Method | Purpose |
|--------|---------|
| `Task SubscribeFleet()` | Join group `all` |
| `Task SubscribeService()` | Join group `service` |
| `Task SubscribePlc(string name)` | Join group `plc:<name>` after validating that `name` exists in current options |
| `Task Unsubscribe()` | Leave every group; the connection stays open but receives nothing |

**Client-side callbacks** (server → client, named `On*` per SignalR convention):

| Callback | Payload | When |
|----------|---------|------|
| `OnSnapshot(StatusResponse snapshot)` | Full snapshot of the relevant scope (`all`, `service`, or a single PLC) | Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect |
| `OnPatch(StatusPatch patch)` | Delta of fields that changed since the last push | Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed |
| `OnEvent(StatusEvent ev)` | Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }` | Immediately — fan-out from the existing `[LoggerMessage]` event call sites |

`StatusPatch` carries only the fields that changed since the previous push: it's a `Dictionary<string, JsonElement>` keyed by JSON path (e.g., `"plcs[2].pdus.forwarded"`, `"plcs[2].listener.state"`). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.

### What gets pushed, and when

| Update kind | Cadence | Volume per PLC | Channel |
|-------------|---------|----------------|---------|
| Counter increments (PDUs, bytes, rewrites) | Every `PushIntervalMs` if changed; coalesced | 1 patch / push tick / subscribed group | `OnPatch` |
| State transitions (`bound ↔ recovering ↔ stopped`) | Immediate | 1 event + 1 patch | `OnEvent` + `OnPatch` |
| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | `OnEvent` |
| Hot-reload applied / rejected | Immediate | 1 event with `propertiesJson` summary | `OnEvent` |
| Periodic full snapshot | Every 60 s | 1 full snapshot | `OnSnapshot` |

The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.

### Configuration

Extend `appsettings.json` with:

```jsonc
"Mbproxy": {
  // ... existing keys ...
  "Admin": {
    "SignalR": {
      "Enabled": true,
      "PushIntervalMs": 1000,            // patch cadence
      "FullSnapshotIntervalMs": 60000,   // periodic re-baseline
      "MaxConcurrentClients": 32,        // refuse new connections beyond this
      "MaxGroupsPerClient": 8            // anti-runaway-subscription guard
    }
  }
}
```

Defaults make the feature opt-in-able-by-omission: if `SignalR.Enabled = false`, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.

### Implementation outline

1. **Hub class** — `src/Mbproxy/Admin/StatusHub.cs`. Inherits `Hub`. Implements the four `Subscribe*` / `Unsubscribe` methods. `OnConnectedAsync` rejects if `Context.Items.Count > MaxConcurrentClients` (track in a static `ConcurrentDictionary<string, byte>` indexed by `ConnectionId`).
2. **Broadcaster** — `src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService`. Constructor takes `IHubContext<StatusHub>`, `StatusSnapshotBuilder`, `IOptionsMonitor<MbproxyOptions>`. The push loop is a `while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }` body — wins over `Timer` for cancellation correctness.
3. **DTOs** — `StatusPatch` and `StatusEvent` records added to `StatusDto.cs`, registered with the source-gen `StatusJsonContext`.
4. **Event fan-out** — the existing `[LoggerMessage]` partial methods stay; add a thin `RealtimeLogEvents` wrapper class that logs AND calls `broadcaster.PushEventAsync(...)`. Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable.
5. **Hub mapping** — `AdminEndpointHost` adds `app.MapHub<StatusHub>("/hub/status")` if `SignalR.Enabled`. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint.
6. **Shutdown** — `StatusBroadcaster.StopAsync` cancels its pump and the hub's `Dispose` chain handles connection teardown. The existing `ShutdownCoordinator` deadline applies.

### Test approach

Use the **`Microsoft.AspNetCore.SignalR.Client`** package (NuGet) in the test csproj only. Pattern:

```csharp
[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
    // Arrange: start host on a random AdminPort, build a SignalR client.
    var connection = new HubConnectionBuilder()
        .WithUrl($"http://localhost:{adminPort}/hub/status")
        .Build();

    var patches = new ConcurrentQueue<StatusPatch>();
    connection.On<StatusPatch>("OnPatch", patches.Enqueue);
    await connection.StartAsync(TestContext.Current.CancellationToken);
    await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);

    // Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
    // ... drive request through proxy ...

    // Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
    var deadline = DateTime.UtcNow.AddMilliseconds(500);
    while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
        await Task.Delay(20, TestContext.Current.CancellationToken);

    patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}
```

Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.

Coverage targets for the new tests:
1. `SignalR_Subscribe_DeliversInitialSnapshot`
2. `SignalR_Patch_FiresWithinPushInterval_AfterCounterChange`
3. `SignalR_Event_FiresWithin_100ms_OfListenerRecovered`
4. `SignalR_SubscribePlc_OnlyDeliversThatPlcEvents` — verifies group filtering
5. `SignalR_MaxConcurrentClients_RefusesExcess` — capacity guard
6. `SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs`

### Operational considerations

- **Authentication / authorisation.** Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
- **Transport.** SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
- **Backpressure.** `Hub.Clients.Group("all").SendAsync` does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy.
- **Reconnection.** The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
- **Cardinality at scale.** 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The `MaxConcurrentClients` guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy.
- **CORS.** If dashboards run on a different origin (likely), enable CORS on the admin app for `/hub/status` only. Add `AdminCors.AllowedOrigins` to `appsettings.json` as an array of allowed origin strings; an empty array means same-origin only.
- **Logging.** SignalR's internal logs are noisy at Information. In `appsettings.json`, set the `Microsoft.AspNetCore.SignalR` category to `Warning` and `Microsoft.AspNetCore.Http.Connections` to `Warning` so the proxy's own event stream isn't drowned out.

### Effort estimate

| Work | Hours |
|------|-------|
| Hub + DTOs + broadcaster | 6 h |
| Event fan-out wiring (existing log events) | 3 h |
| AdminEndpointHost integration + appsettings binding | 2 h |
| E2E test suite (6 tests using SignalR .NET client) | 4 h |
| Documentation (this section graduates from proposal to fact; design.md update) | 1 h |
| **Total** | **~16 h** |

This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.

---

## Implementation notes

### Where rates and percentiles should live

Two reasonable answers:

1. **Compute in the proxy, expose pre-computed values in `/status.json`.** Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change.
2. **Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates.** Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.

**Recommendation:** ship Tier 1 rate metrics computed in-process for the operator who just opens `http://<host>:8080/` in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.

### Counter additions vs computed values

A few proposed KPIs require **new counters in `ProxyCounters` or `ServiceCounters`**, not just derivations:

- `pdus.lastForwardedUtc` — new `volatile long _lastForwardedTicks` on `ProxyCounters`.
- `listener.boundRatio.*` — new `StateDurationTracker` on `PlcListenerSupervisor`.
- `partialBcd.byClient` / `invalidBcd.byAddress` — new `ConcurrentDictionary<string,long>` / `ConcurrentDictionary<ushort,long>` on `PerPlcContext`. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).
- `process.*` — read fresh on every snapshot from `Process.GetCurrentProcess()` — no stored state.

### Snapshot serialization cost

`StatusResponse` is built per-request to `/status.json`. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., `invalidBcd.byAddress`) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep `/status.json` under a few hundred KB even when something goes badly wrong.

### Dashboard widget mapping (Grafana-style cheat sheet)

| Widget | Use for |
|--------|---------|
| **Stat (big number)** | Service-wide aggregates, counts, latest timestamps |
| **Gauge** | Ratios (availability, success rate, queue depth) |
| **Sparkline** | Rates, percentiles, time-series trends |
| **Stacked area** | Bandwidth, PDU-by-FC breakdown over time |
| **Heatmap** | Per-address / per-client dimensional breakdowns |
| **Cell-coloured table** | Per-PLC status (54 rows, one per PLC, columns of KPIs) |

### Backwards-compat policy

The fields currently in `/status.json` are **frozen** — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in [`design.md`](design.md) → "Status page" as the contract; new fields ship via PRs that update the contract first.

## Cross-references

- Field tables for what ships today: [`design.md`](design.md) → "Status page".
- Stable log event names (some KPIs are derivable by tailing these): [`design.md`](design.md) → "Logging" event-name table.
- Per-counter wiring lives in `src/Mbproxy/Proxy/ProxyCounters.cs` and `src/Mbproxy/ServiceCounters.cs`.
- The status HTML page is rendered by `src/Mbproxy/Admin/StatusHtmlRenderer.cs`; the JSON DTOs and source-gen context live in `src/Mbproxy/Admin/StatusDto.cs`.