# mbproxy — Dashboard KPI catalogue Recommended additions to the `/status.json` and `/` admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's `/status.json` exposes raw cumulative counters; this doc describes what's typically *also* expected when those counters land in Grafana / Wonderware / a custom HMI. **Scope.** This is a proposal, not a contract. The endpoint shape settled in [`design.md`](design.md) → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect. **Reading guide.** Each KPI has: - **Name** — short identifier matching the proxy's existing camelCase convention. - **Definition** — what the number means. - **Source** — where the value comes from (existing counter, new counter, derived). - **Widget** — typical dashboard visualisation. - **Alert** — common threshold or anomaly rule (where applicable). - **Effort** — implementation cost in hours (rough order-of-magnitude). ## What's exposed today (recap) For context — every recommended addition below is *in addition to* this list. Today's `/status.json` carries: | Group | Fields | |-------|--------| | Service | `uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount` | | Listeners | `bound`, `configured` | | Per-PLC listener | `state`, `lastBindError`, `recoveryAttempts` | | Per-PLC clients | `connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded) | | Per-PLC PDUs | `forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings` | | Per-PLC backend | `connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`, `inFlight`, `maxInFlight`, `txIdWraps`, `disconnectCascades`, `queueDepth`, `coalescedHitCount`, `coalescedMissCount`, `coalescedResponseToDeadUpstream`, `cacheHitCount`, `cacheMissCount`, `cacheInvalidations`, `cacheEntryCount`, `cacheBytes` | | Per-PLC bytes | `upstreamIn`, `upstreamOut` | Counters are **cumulative since process start**. A restart resets them. --- ## Tier 1 — strongly recommended for production These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard." ### 1.1 Rate metrics (per-PLC and fleet-wide) | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `pdus.ratePerSec.last1m` | PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h | | `pdus.ratePerSec.last5m` | Same over 5 min | Same buffer at 300 s | Sparkline | None | shared | | `errors.ratePerMin` | Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute | Derived | Stat tile per PLC | > 10/min → page | 2 h | | `bytes.ratePerSec.up` / `.down` | Bandwidth each direction | Derived from `bytesUpstreamIn/Out` deltas | Stacked area | None — informational | 2 h | | `fleet.totalPdusPerSec` | Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h | **Why this matters.** Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing `rate(pdus_forwarded[1m])` on a 54-row fleet is the single most informative widget on the dashboard. **Implementation note.** Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in `/status.json` directly, add a per-PLC `Mbproxy.Proxy.RateTracker` with a fixed-size circular buffer of 60 one-second samples and expose `RatePerSec1m`, `RatePerSec5m`. ### 1.2 Latency percentiles (replacing the bare EWMA) | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `backend.roundTripMs.p50` | Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h | | `backend.roundTripMs.p95` | 95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared | | `backend.roundTripMs.p99` | 99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared | | `backend.roundTripMs.max1m` | Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared | **Why this matters.** The existing `lastRoundTripMs` is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently. **Implementation note.** Use `Mbproxy.Proxy.LatencyReservoir` — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much). ### 1.3 Per-PLC availability ratio | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `listener.boundRatio.last1h` | Fraction of time in `bound` state over last hour | New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h | | `listener.boundRatio.sinceStart` | Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared | | `listener.timeInRecoveringMs.last1h` | Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared | **Why this matters.** `recoveryAttempts` tells you how many times something has flapped, but not how *much* downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly. **Implementation note.** Each `PlcListenerSupervisor` already has a state machine. Add a `StateDurationTracker` that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window. ### 1.4 Liveness / staleness signals | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `pdus.lastForwardedUtc` | Wall time of the most recent forwarded PDU | New `_lastForwardedTimestamp` per PLC | Stat tile | `now - value > 5 min AND clients.connected > 0` → page | 1 h | | `clients.lastActivityUtc` | Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h | | `staleClients.count` | Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h | **Why this matters.** Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with `clients.connected = 2` but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured. ### 1.5 Service-wide fleet aggregates These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard. | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `fleet.plcsHealthy` | Count of PLCs in `bound` state with no errors in last 5 min | Aggregate | Big number, green | < `listeners.configured - 2` → warn | 2 h | | `fleet.plcsRecovering` | Count in `recovering` state | Aggregate | Big number, orange | > 0 → informational | shared | | `fleet.plcsStopped` | Count in `stopped` state | Aggregate | Big number, grey | > 0 → page | shared | | `fleet.plcsWithActiveErrors` | Count with `errors.ratePerMin > 0` | Aggregate | Big number, red | > 0 → page | shared | | `fleet.totalClientsConnected` | Sum of `clients.connected` | Aggregate | Stat tile | None | 1 h | | `fleet.totalRewrittenSlotsPerSec` | Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared | **Why this matters.** A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table. ### 1.6 Multiplexer state — **shipped in [Phase 9](plan/09-txid-multiplexing.md)** The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth. | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `backend.inFlightCount` | Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) | | `backend.maxInFlight` | Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) | | `backend.txIdWraps` | Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) | | `backend.queueDepth` | Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) | | `backend.disconnectCascades` | Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with `mbproxy.backend.failed` events | (in Phase 9 scope) | **Why this matters.** Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's `lastRoundTripMs` measures wire latency only; queue depth reveals proxy-side backlog). ### 1.7 Read coalescing — **shipped in [Phase 10](plan/10-read-coalescing.md)** Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. `coalescedHitCount + coalescedMissCount` equals total FC03/04 request count per snapshot — the math always balances. | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `backend.coalescedHitCount` | FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) | | `backend.coalescedMissCount` | FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) | | `backend.coalescingRatio` | `Hit / (Hit + Miss)` over the trailing window | Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) | | `backend.coalescedResponseToDeadUpstream` | Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) | **Why this matters.** Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric. ### 1.8 Response cache — **shipped in [Phase 11](plan/11-response-cache.md)** After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting `CacheTtlMs > 0` on a `BcdTagOptions` entry (or `DefaultCacheTtlMs > 0` on a `PlcOptions` entry). | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `backend.cacheHitCount` | FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) | | `backend.cacheMissCount` | FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) | | `backend.cacheHitRatio` | `Hit / (Hit + Miss)` for cache-eligible reads | Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) | | `backend.cacheInvalidations` | Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) | **Why this matters.** Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts. --- ## Tier 2 — nice-to-have Reach for these once Tier 1 is solid. They add depth for specific operational scenarios. ### 2.1 Connection-cap saturation warning > **Status: superseded by [Phase 9](plan/09-txid-multiplexing.md).** This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is `backend.inFlightCount` (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. **Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.** | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `clients.atCapWarning` | Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap) | Derived | Cell highlight | True → warn | 1 h | | `clients.atCapBlocked` | Boolean: `clients.connected >= 4` (cap reached) | Derived | Cell highlight | True → page | shared | **Why this mattered (pre-Phase-9).** The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see [design.md](design.md) → "Connection model" and [DL260/dl205.md](../DL260/dl205.md) → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments. ### 2.2 Error breakdown / heatmap | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `partialBcd.byClient` | Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h | | `invalidBcd.byAddress` | Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h | | `exceptions.byCodeRate` | Per-exception-code rate over 5 min | Derived from `exceptionsByCode.*` | Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h | **Why this matters.** Once you've seen `partialBcdWarnings = 1247`, the next question is *which client* and *which tag*. Without dimensional breakdown, you have to ssh into the log file to find out. ### 2.3 Hot-reload cadence | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `config.reloadsPerHour` | Reload events per hour | Derived from `configReloadCount` | Sparkline | > 10/hr → unusual; misconfig loop? | 1 h | | `config.lastReloadDelta` | Summary of what changed on last reload | Already in `mbproxy.config.reload.applied` event; surface here | Text snippet | None — informational | 2 h | **Why this matters.** Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured. ### 2.4a Response-cache memory — **shipped in [Phase 11](plan/11-response-cache.md)** When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC `MaxEntriesPerPlc` cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob. | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `backend.cacheEntryCount` | Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = `MaxEntriesPerPlc` → consider raising the cap | (in Phase 11 scope) | | `backend.cacheBytes` | Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) | **Why this matters.** Cache entries are short-lived (TTLs are typically seconds, not minutes). A `cacheEntryCount` that sits at `MaxEntriesPerPlc` for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. `cacheBytes` is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression. ### 2.4 Memory / process health | KPI | Definition | Source | Widget | Alert | Effort | |-----|------------|--------|--------|-------|--------| | `process.workingSetMb` | `Process.GetCurrentProcess().WorkingSet64 / 1MB` | New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h | | `process.gcCollections.gen0/1/2` | GC counts per generation | `GC.CollectionCount(n)` | Sparkline | Gen-2 frequency → memory pressure | 0.5 h | | `process.threadCount` | `Process.Threads.Count` | New | Stat tile | > 200 → leak? | 0.5 h | **Why this matters.** A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one `Process` API call, no perf overhead. --- ## Real-time updates via SignalR Today's status surface is poll-based: the HTML page uses a 5-second `meta-refresh`, and Prometheus / custom HMI scrapers hit `/status.json` on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a **live fleet dashboard with many panels open**, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping `bound → recovering`) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update. **The recommendation is additive, not replacement.** Keep `/status.json` for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change. ### Why this is cheap to add The `Microsoft.AspNetCore.App` framework reference that Phase 07 added to the csproj **already includes `Microsoft.AspNetCore.SignalR`** — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on `Mbproxy.AdminPort`. No additional port, no additional listener supervision, no additional shutdown path. ### Architecture ``` ┌─→ Dashboard A (subscribed to "all") ProxyWorker / Supervisors ──┐ │ ConfigReconciler ───────────┤ │ ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer") ServiceCounters ────────────┘ (background loop + │ immediate-push paths) └─→ Dashboard C (subscribed to "service") ``` - **`StatusHub : Hub`** — the SignalR endpoint mounted at `/hub/status` on `AdminPort`. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates. - **`StatusBroadcaster : IHostedService`** — the background pusher. Holds a `Timer` (or `PeriodicTimer`) that ticks at `PushIntervalMs` (default 1000 ms), builds a `StatusResponse` via the existing `StatusSnapshotBuilder`, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes `PushEventAsync(name, props)` for the immediate-push paths. - **Immediate-push wiring** — the existing log events (`mbproxy.listener.recovered`, `mbproxy.config.reload.applied`, `mbproxy.backend.failed`, `mbproxy.rewrite.partial_bcd`, etc.) gain a fan-out call to `broadcaster.PushEventAsync(...)` so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick. ### Hub contract **Hub URL:** `https://:/hub/status` **Hub groups** — clients subscribe to scopes; the server broadcasts to matching groups: | Group | Receives | |-------|----------| | `all` | Every update for every PLC + every service-level event | | `service` | Service-level events only (`mbproxy.config.*`, `mbproxy.admin.*`, `mbproxy.startup.*`, `mbproxy.shutdown.*`) | | `plc:` | One PLC's snapshots + that PLC's events | **Server-side methods** (client → server): | Method | Purpose | |--------|---------| | `Task SubscribeFleet()` | Join group `all` | | `Task SubscribeService()` | Join group `service` | | `Task SubscribePlc(string name)` | Join group `plc:` after validating that `name` exists in current options | | `Task Unsubscribe()` | Leave every group; the connection stays open but receives nothing | **Client-side callbacks** (server → client, named `On*` per SignalR convention): | Callback | Payload | When | |----------|---------|------| | `OnSnapshot(StatusResponse snapshot)` | Full snapshot of the relevant scope (`all`, `service`, or a single PLC) | Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect | | `OnPatch(StatusPatch patch)` | Delta of fields that changed since the last push | Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed | | `OnEvent(StatusEvent ev)` | Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }` | Immediately — fan-out from the existing `[LoggerMessage]` event call sites | `StatusPatch` carries only the fields that changed since the previous push: it's a `Dictionary` keyed by JSON path (e.g., `"plcs[2].pdus.forwarded"`, `"plcs[2].listener.state"`). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle. ### What gets pushed, and when | Update kind | Cadence | Volume per PLC | Channel | |-------------|---------|----------------|---------| | Counter increments (PDUs, bytes, rewrites) | Every `PushIntervalMs` if changed; coalesced | 1 patch / push tick / subscribed group | `OnPatch` | | State transitions (`bound ↔ recovering ↔ stopped`) | Immediate | 1 event + 1 patch | `OnEvent` + `OnPatch` | | Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | `OnEvent` | | Hot-reload applied / rejected | Immediate | 1 event with `propertiesJson` summary | `OnEvent` | | Periodic full snapshot | Every 60 s | 1 full snapshot | `OnSnapshot` | The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth. ### Configuration Extend `appsettings.json` with: ```jsonc "Mbproxy": { // ... existing keys ... "Admin": { "SignalR": { "Enabled": true, "PushIntervalMs": 1000, // patch cadence "FullSnapshotIntervalMs": 60000, // periodic re-baseline "MaxConcurrentClients": 32, // refuse new connections beyond this "MaxGroupsPerClient": 8 // anti-runaway-subscription guard } } } ``` Defaults make the feature opt-in-able-by-omission: if `SignalR.Enabled = false`, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required. ### Implementation outline 1. **Hub class** — `src/Mbproxy/Admin/StatusHub.cs`. Inherits `Hub`. Implements the four `Subscribe*` / `Unsubscribe` methods. `OnConnectedAsync` rejects if `Context.Items.Count > MaxConcurrentClients` (track in a static `ConcurrentDictionary` indexed by `ConnectionId`). 2. **Broadcaster** — `src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService`. Constructor takes `IHubContext`, `StatusSnapshotBuilder`, `IOptionsMonitor`. The push loop is a `while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }` body — wins over `Timer` for cancellation correctness. 3. **DTOs** — `StatusPatch` and `StatusEvent` records added to `StatusDto.cs`, registered with the source-gen `StatusJsonContext`. 4. **Event fan-out** — the existing `[LoggerMessage]` partial methods stay; add a thin `RealtimeLogEvents` wrapper class that logs AND calls `broadcaster.PushEventAsync(...)`. Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable. 5. **Hub mapping** — `AdminEndpointHost` adds `app.MapHub("/hub/status")` if `SignalR.Enabled`. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint. 6. **Shutdown** — `StatusBroadcaster.StopAsync` cancels its pump and the hub's `Dispose` chain handles connection teardown. The existing `ShutdownCoordinator` deadline applies. ### Test approach Use the **`Microsoft.AspNetCore.SignalR.Client`** package (NuGet) in the test csproj only. Pattern: ```csharp [Fact] [Trait("Category", "E2E")] public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException() { // Arrange: start host on a random AdminPort, build a SignalR client. var connection = new HubConnectionBuilder() .WithUrl($"http://localhost:{adminPort}/hub/status") .Build(); var patches = new ConcurrentQueue(); connection.On("OnPatch", patches.Enqueue); await connection.StartAsync(TestContext.Current.CancellationToken); await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken); // Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1). // ... drive request through proxy ... // Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms. var deadline = DateTime.UtcNow.AddMilliseconds(500); while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"))) await Task.Delay(20, TestContext.Current.CancellationToken); patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")); } ``` Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly. Coverage targets for the new tests: 1. `SignalR_Subscribe_DeliversInitialSnapshot` 2. `SignalR_Patch_FiresWithinPushInterval_AfterCounterChange` 3. `SignalR_Event_FiresWithin_100ms_OfListenerRecovered` 4. `SignalR_SubscribePlc_OnlyDeliversThatPlcEvents` — verifies group filtering 5. `SignalR_MaxConcurrentClients_RefusesExcess` — capacity guard 6. `SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs` ### Operational considerations - **Authentication / authorisation.** Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy. - **Transport.** SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport. - **Backpressure.** `Hub.Clients.Group("all").SendAsync` does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy. - **Reconnection.** The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side. - **Cardinality at scale.** 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The `MaxConcurrentClients` guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy. - **CORS.** If dashboards run on a different origin (likely), enable CORS on the admin app for `/hub/status` only. Add `AdminCors.AllowedOrigins` to `appsettings.json` as an array of allowed origin strings; an empty array means same-origin only. - **Logging.** SignalR's internal logs are noisy at Information. In `appsettings.json`, set the `Microsoft.AspNetCore.SignalR` category to `Warning` and `Microsoft.AspNetCore.Http.Connections` to `Warning` so the proxy's own event stream isn't drowned out. ### Effort estimate | Work | Hours | |------|-------| | Hub + DTOs + broadcaster | 6 h | | Event fan-out wiring (existing log events) | 3 h | | AdminEndpointHost integration + appsettings binding | 2 h | | E2E test suite (6 tests using SignalR .NET client) | 4 h | | Documentation (this section graduates from proposal to fact; design.md update) | 1 h | | **Total** | **~16 h** | This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production. --- ## Implementation notes ### Where rates and percentiles should live Two reasonable answers: 1. **Compute in the proxy, expose pre-computed values in `/status.json`.** Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change. 2. **Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates.** Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar. **Recommendation:** ship Tier 1 rate metrics computed in-process for the operator who just opens `http://:8080/` in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative. ### Counter additions vs computed values A few proposed KPIs require **new counters in `ProxyCounters` or `ServiceCounters`**, not just derivations: - `pdus.lastForwardedUtc` — new `volatile long _lastForwardedTicks` on `ProxyCounters`. - `listener.boundRatio.*` — new `StateDurationTracker` on `PlcListenerSupervisor`. - `partialBcd.byClient` / `invalidBcd.byAddress` — new `ConcurrentDictionary` / `ConcurrentDictionary` on `PerPlcContext`. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases). - `process.*` — read fresh on every snapshot from `Process.GetCurrentProcess()` — no stored state. ### Snapshot serialization cost `StatusResponse` is built per-request to `/status.json`. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., `invalidBcd.byAddress`) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep `/status.json` under a few hundred KB even when something goes badly wrong. ### Dashboard widget mapping (Grafana-style cheat sheet) | Widget | Use for | |--------|---------| | **Stat (big number)** | Service-wide aggregates, counts, latest timestamps | | **Gauge** | Ratios (availability, success rate, queue depth) | | **Sparkline** | Rates, percentiles, time-series trends | | **Stacked area** | Bandwidth, PDU-by-FC breakdown over time | | **Heatmap** | Per-address / per-client dimensional breakdowns | | **Cell-coloured table** | Per-PLC status (54 rows, one per PLC, columns of KPIs) | ### Backwards-compat policy The fields currently in `/status.json` are **frozen** — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in [`design.md`](design.md) → "Status page" as the contract; new fields ship via PRs that update the contract first. ## Cross-references - Field tables for what ships today: [`design.md`](design.md) → "Status page". - Stable log event names (some KPIs are derivable by tailing these): [`design.md`](design.md) → "Logging" event-name table. - Per-counter wiring lives in `src/Mbproxy/Proxy/ProxyCounters.cs` and `src/Mbproxy/ServiceCounters.cs`. - The status HTML page is rendered by `src/Mbproxy/Admin/StatusHtmlRenderer.cs`; the JSON DTOs and source-gen context live in `src/Mbproxy/Admin/StatusDto.cs`.