Files
wwtools/mbproxy/docs/kpi.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

32 KiB
Raw Blame History

mbproxy — Dashboard KPI catalogue

Recommended additions to the /status.json and / admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's /status.json exposes raw cumulative counters; this doc describes what's typically also expected when those counters land in Grafana / Wonderware / a custom HMI.

Scope. This is a proposal, not a contract. The endpoint shape settled in design.md → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.

Reading guide. Each KPI has:

  • Name — short identifier matching the proxy's existing camelCase convention.
  • Definition — what the number means.
  • Source — where the value comes from (existing counter, new counter, derived).
  • Widget — typical dashboard visualisation.
  • Alert — common threshold or anomaly rule (where applicable).
  • Effort — implementation cost in hours (rough order-of-magnitude).

What's exposed today (recap)

For context — every recommended addition below is in addition to this list. Today's /status.json carries:

Group Fields
Service uptimeSeconds, version, configLastReloadUtc, configReloadCount, configReloadRejectedCount
Listeners bound, configured
Per-PLC listener state, lastBindError, recoveryAttempts
Per-PLC clients connected, remoteEndpoints[] (remote, connectedAtUtc, pdusForwarded)
Per-PLC PDUs forwarded, byFc.{fc03,fc04,fc06,fc16,other}, rewrittenSlots, partialBcdWarnings
Per-PLC backend connectsSuccess, connectsFailed, exceptionsByCode.{code01..code04}, lastRoundTripMs
Per-PLC bytes upstreamIn, upstreamOut

Counters are cumulative since process start. A restart resets them.


These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."

1.1 Rate metrics (per-PLC and fleet-wide)

KPI Definition Source Widget Alert Effort
pdus.ratePerSec.last1m PDU rate over the last 60 s New per-PLC ring buffer (60 × 1 s samples) Sparkline per PLC None — informational 4 h
pdus.ratePerSec.last5m Same over 5 min Same buffer at 300 s Sparkline None shared
errors.ratePerMin Sum of exceptionsByCode.* + partialBcdWarnings + invalidBcdWarnings per minute Derived Stat tile per PLC > 10/min → page 2 h
bytes.ratePerSec.up / .down Bandwidth each direction Derived from bytesUpstreamIn/Out deltas Stacked area None — informational 2 h
fleet.totalPdusPerSec Sum of all PLCs' rates Aggregate Single number, big None 1 h

Why this matters. Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing rate(pdus_forwarded[1m]) on a 54-row fleet is the single most informative widget on the dashboard.

Implementation note. Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in /status.json directly, add a per-PLC Mbproxy.Proxy.RateTracker with a fixed-size circular buffer of 60 one-second samples and expose RatePerSec1m, RatePerSec5m.

1.2 Latency percentiles (replacing the bare EWMA)

KPI Definition Source Widget Alert Effort
backend.roundTripMs.p50 Median backend round-trip over last 1 min New per-PLC reservoir sample (size 256) Line chart, per-PLC None 6 h
backend.roundTripMs.p95 95th percentile Same reservoir Line chart > 500 ms sustained 5 min → warn shared
backend.roundTripMs.p99 99th percentile Same reservoir Line chart > 2 s sustained 5 min → page shared
backend.roundTripMs.max1m Slowest single PDU in last 1 min Same reservoir Stat tile > 5 s → page shared

Why this matters. The existing lastRoundTripMs is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.

Implementation note. Use Mbproxy.Proxy.LatencyReservoir — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).

1.3 Per-PLC availability ratio

KPI Definition Source Widget Alert Effort
listener.boundRatio.last1h Fraction of time in bound state over last hour New per-supervisor state-time tracker Gauge per PLC < 0.99 → warn, < 0.95 → page 4 h
listener.boundRatio.sinceStart Fraction over process lifetime Same tracker Gauge < 0.999 → warn shared
listener.timeInRecoveringMs.last1h Total time spent recovering in last hour Same tracker Stat tile > 60s → warn shared

Why this matters. recoveryAttempts tells you how many times something has flapped, but not how much downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.

Implementation note. Each PlcListenerSupervisor already has a state machine. Add a StateDurationTracker that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.

1.4 Liveness / staleness signals

KPI Definition Source Widget Alert Effort
pdus.lastForwardedUtc Wall time of the most recent forwarded PDU New _lastForwardedTimestamp per PLC Stat tile now - value > 5 min AND clients.connected > 0 → page 1 h
clients.lastActivityUtc Per-client last-PDU timestamp Already implicit; expose explicitly Per-row in remoteEndpoints None 1 h
staleClients.count Connected clients with no PDUs in last 5 min Derived Stat tile > 0 → informational 1 h

Why this matters. Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with clients.connected = 2 but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.

1.5 Service-wide fleet aggregates

These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.

KPI Definition Source Widget Alert Effort
fleet.plcsHealthy Count of PLCs in bound state with no errors in last 5 min Aggregate Big number, green < listeners.configured - 2 → warn 2 h
fleet.plcsRecovering Count in recovering state Aggregate Big number, orange > 0 → informational shared
fleet.plcsStopped Count in stopped state Aggregate Big number, grey > 0 → page shared
fleet.plcsWithActiveErrors Count with errors.ratePerMin > 0 Aggregate Big number, red > 0 → page shared
fleet.totalClientsConnected Sum of clients.connected Aggregate Stat tile None 1 h
fleet.totalRewrittenSlotsPerSec Sum of rewrite rates Aggregate + derived Sparkline None shared

Why this matters. A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.

1.6 Multiplexer state — shipped in Phase 9

The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.

KPI Definition Source Widget Alert Effort
backend.inFlightCount Current in-flight Modbus requests on this PLC's backend connection Phase-9 counter Sparkline per PLC Sustained > 100 → investigate (high churn or slow backend) (in Phase 9 scope)
backend.maxInFlight Peak in-flight count observed since process start Phase-9 counter Stat tile per PLC Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) (in Phase 9 scope)
backend.txIdWraps Times the TxId allocator has wrapped 0xFFFF → 0x0000 Phase-9 counter Stat tile per PLC Sudden increase rate → very high in-flight churn; investigate fairness (in Phase 9 scope)
backend.queueDepth Current outbound channel depth (frames queued for the backend writer) Phase-9 counter Sparkline per PLC Sustained > 50 → backend is slower than upstream demand; latency rising (in Phase 9 scope)
backend.disconnectCascades Total upstream clients closed due to backend disconnects Phase-9 counter Stat tile per PLC Spike → network instability; correlate with mbproxy.backend.failed events (in Phase 9 scope)

Why this matters. Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's lastRoundTripMs measures wire latency only; queue depth reveals proxy-side backlog).

1.7 Read coalescing — requires Phase 10

After Phase 10 ships, same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric.

KPI Definition Source Widget Alert Effort
backend.coalescedHitCount FC03/04 requests attached to an already-in-flight peer Phase-10 counter Sparkline None — trend-watch (in Phase 10 scope)
backend.coalescedMissCount FC03/04 requests that created a fresh backend round-trip Phase-10 counter Sparkline None — trend-watch (in Phase 10 scope)
backend.coalescingRatio Hit / (Hit + Miss) over the trailing window Derived (dashboard) Stat tile per PLC None; a low ratio just means clients aren't synchronised on the same registers — informational (in Phase 10 scope)
backend.coalescedResponseToDeadUpstream Fan-out responses dropped because the attached upstream disconnected mid-flight Phase-10 counter Stat tile per PLC Spike → client churn during traffic burst; usually not actionable (in Phase 10 scope)

Why this matters. Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.

1.8 Response cache — requires Phase 11

After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.

KPI Definition Source Widget Alert Effort
backend.cacheHitCount FC03/04 requests served from the cache Phase-11 counter Sparkline per PLC None — informational (in Phase 11 scope)
backend.cacheMissCount FC03/04 requests that fell through to the backend (or coalescing) Phase-11 counter Sparkline per PLC None — informational (in Phase 11 scope)
backend.cacheHitRatio Hit / (Hit + Miss) for cache-eligible reads Derived (dashboard) Stat tile per PLC None; informs whether TTL tuning is worthwhile (in Phase 11 scope)
backend.cacheInvalidations Cache entries invalidated by FC06/FC16 write responses Phase-11 counter Stat tile per PLC High rate → many writes to cached addresses; consider reducing TTL on those tags (in Phase 11 scope)

Why this matters. Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.


Tier 2 — nice-to-have

Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.

2.1 Connection-cap saturation warning

Status: superseded by Phase 9. This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is backend.inFlightCount (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.

KPI Definition Source Widget Alert Effort
clients.atCapWarning Boolean: clients.connected >= 3 (1 short of ECOM100's 4-client cap) Derived Cell highlight True → warn 1 h
clients.atCapBlocked Boolean: clients.connected >= 4 (cap reached) Derived Cell highlight True → page shared

Why this mattered (pre-Phase-9). The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see design.md → "Connection model" and DL260/dl205.md → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.

2.2 Error breakdown / heatmap

KPI Definition Source Widget Alert Effort
partialBcd.byClient Count of partial-BCD warnings grouped by client remote endpoint New per-client counter Top-N list Top-1 > 100/hr → ops should check the client's tag definition 3 h
invalidBcd.byAddress Count of invalid-BCD events grouped by Modbus address New per-address counter (small map) Heatmap Single address with persistent rate → broken PLC logic 4 h
exceptions.byCodeRate Per-exception-code rate over 5 min Derived from exceptionsByCode.* Stacked bar Code 04 (Slave Failure) spike → PLC in PROGRAM mode? 2 h

Why this matters. Once you've seen partialBcdWarnings = 1247, the next question is which client and which tag. Without dimensional breakdown, you have to ssh into the log file to find out.

2.3 Hot-reload cadence

KPI Definition Source Widget Alert Effort
config.reloadsPerHour Reload events per hour Derived from configReloadCount Sparkline > 10/hr → unusual; misconfig loop? 1 h
config.lastReloadDelta Summary of what changed on last reload Already in mbproxy.config.reload.applied event; surface here Text snippet None — informational 2 h

Why this matters. Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.

2.4 Memory / process health

KPI Definition Source Widget Alert Effort
process.workingSetMb Process.GetCurrentProcess().WorkingSet64 / 1MB New Stat tile > 1024 MB → warn (54 PLCs shouldn't need that much) 0.5 h
process.gcCollections.gen0/1/2 GC counts per generation GC.CollectionCount(n) Sparkline Gen-2 frequency → memory pressure 0.5 h
process.threadCount Process.Threads.Count New Stat tile > 200 → leak? 0.5 h

Why this matters. A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one Process API call, no perf overhead.


Real-time updates via SignalR

Today's status surface is poll-based: the HTML page uses a 5-second meta-refresh, and Prometheus / custom HMI scrapers hit /status.json on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a live fleet dashboard with many panels open, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping bound → recovering) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.

The recommendation is additive, not replacement. Keep /status.json for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.

Why this is cheap to add

The Microsoft.AspNetCore.App framework reference that Phase 07 added to the csproj already includes Microsoft.AspNetCore.SignalR — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on Mbproxy.AdminPort. No additional port, no additional listener supervision, no additional shutdown path.

Architecture

                                  ┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐     │
ConfigReconciler ───────────┤     │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘     (background loop +                │
                                   immediate-push paths)            └─→ Dashboard C (subscribed to "service")
  • StatusHub : Hub — the SignalR endpoint mounted at /hub/status on AdminPort. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.
  • StatusBroadcaster : IHostedService — the background pusher. Holds a Timer (or PeriodicTimer) that ticks at PushIntervalMs (default 1000 ms), builds a StatusResponse via the existing StatusSnapshotBuilder, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes PushEventAsync(name, props) for the immediate-push paths.
  • Immediate-push wiring — the existing log events (mbproxy.listener.recovered, mbproxy.config.reload.applied, mbproxy.backend.failed, mbproxy.rewrite.partial_bcd, etc.) gain a fan-out call to broadcaster.PushEventAsync(...) so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.

Hub contract

Hub URL: https://<host>:<AdminPort>/hub/status

Hub groups — clients subscribe to scopes; the server broadcasts to matching groups:

Group Receives
all Every update for every PLC + every service-level event
service Service-level events only (mbproxy.config.*, mbproxy.admin.*, mbproxy.startup.*, mbproxy.shutdown.*)
plc:<Name> One PLC's snapshots + that PLC's events

Server-side methods (client → server):

Method Purpose
Task SubscribeFleet() Join group all
Task SubscribeService() Join group service
Task SubscribePlc(string name) Join group plc:<name> after validating that name exists in current options
Task Unsubscribe() Leave every group; the connection stays open but receives nothing

Client-side callbacks (server → client, named On* per SignalR convention):

Callback Payload When
OnSnapshot(StatusResponse snapshot) Full snapshot of the relevant scope (all, service, or a single PLC) Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect
OnPatch(StatusPatch patch) Delta of fields that changed since the last push Periodic — every PushIntervalMs if anything changed; skipped if nothing changed
OnEvent(StatusEvent ev) Single discrete event: { name, levelString, plc?, propertiesJson, timestampUtc } Immediately — fan-out from the existing [LoggerMessage] event call sites

StatusPatch carries only the fields that changed since the previous push: it's a Dictionary<string, JsonElement> keyed by JSON path (e.g., "plcs[2].pdus.forwarded", "plcs[2].listener.state"). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.

What gets pushed, and when

Update kind Cadence Volume per PLC Channel
Counter increments (PDUs, bytes, rewrites) Every PushIntervalMs if changed; coalesced 1 patch / push tick / subscribed group OnPatch
State transitions (bound ↔ recovering ↔ stopped) Immediate 1 event + 1 patch OnEvent + OnPatch
Discrete log events at level ≥ Info from the stable vocabulary Immediate 1 event per occurrence OnEvent
Hot-reload applied / rejected Immediate 1 event with propertiesJson summary OnEvent
Periodic full snapshot Every 60 s 1 full snapshot OnSnapshot

The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.

Configuration

Extend appsettings.json with:

"Mbproxy": {
  // ... existing keys ...
  "Admin": {
    "SignalR": {
      "Enabled": true,
      "PushIntervalMs": 1000,            // patch cadence
      "FullSnapshotIntervalMs": 60000,   // periodic re-baseline
      "MaxConcurrentClients": 32,        // refuse new connections beyond this
      "MaxGroupsPerClient": 8            // anti-runaway-subscription guard
    }
  }
}

Defaults make the feature opt-in-able-by-omission: if SignalR.Enabled = false, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.

Implementation outline

  1. Hub classsrc/Mbproxy/Admin/StatusHub.cs. Inherits Hub. Implements the four Subscribe* / Unsubscribe methods. OnConnectedAsync rejects if Context.Items.Count > MaxConcurrentClients (track in a static ConcurrentDictionary<string, byte> indexed by ConnectionId).
  2. Broadcastersrc/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService. Constructor takes IHubContext<StatusHub>, StatusSnapshotBuilder, IOptionsMonitor<MbproxyOptions>. The push loop is a while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... } body — wins over Timer for cancellation correctness.
  3. DTOsStatusPatch and StatusEvent records added to StatusDto.cs, registered with the source-gen StatusJsonContext.
  4. Event fan-out — the existing [LoggerMessage] partial methods stay; add a thin RealtimeLogEvents wrapper class that logs AND calls broadcaster.PushEventAsync(...). Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable.
  5. Hub mappingAdminEndpointHost adds app.MapHub<StatusHub>("/hub/status") if SignalR.Enabled. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint.
  6. ShutdownStatusBroadcaster.StopAsync cancels its pump and the hub's Dispose chain handles connection teardown. The existing ShutdownCoordinator deadline applies.

Test approach

Use the Microsoft.AspNetCore.SignalR.Client package (NuGet) in the test csproj only. Pattern:

[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
    // Arrange: start host on a random AdminPort, build a SignalR client.
    var connection = new HubConnectionBuilder()
        .WithUrl($"http://localhost:{adminPort}/hub/status")
        .Build();

    var patches = new ConcurrentQueue<StatusPatch>();
    connection.On<StatusPatch>("OnPatch", patches.Enqueue);
    await connection.StartAsync(TestContext.Current.CancellationToken);
    await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);

    // Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
    // ... drive request through proxy ...

    // Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
    var deadline = DateTime.UtcNow.AddMilliseconds(500);
    while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
        await Task.Delay(20, TestContext.Current.CancellationToken);

    patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}

Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.

Coverage targets for the new tests:

  1. SignalR_Subscribe_DeliversInitialSnapshot
  2. SignalR_Patch_FiresWithinPushInterval_AfterCounterChange
  3. SignalR_Event_FiresWithin_100ms_OfListenerRecovered
  4. SignalR_SubscribePlc_OnlyDeliversThatPlcEvents — verifies group filtering
  5. SignalR_MaxConcurrentClients_RefusesExcess — capacity guard
  6. SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs

Operational considerations

  • Authentication / authorisation. Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
  • Transport. SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
  • Backpressure. Hub.Clients.Group("all").SendAsync does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy.
  • Reconnection. The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
  • Cardinality at scale. 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The MaxConcurrentClients guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy.
  • CORS. If dashboards run on a different origin (likely), enable CORS on the admin app for /hub/status only. Add AdminCors.AllowedOrigins to appsettings.json as an array of allowed origin strings; an empty array means same-origin only.
  • Logging. SignalR's internal logs are noisy at Information. In appsettings.json, set the Microsoft.AspNetCore.SignalR category to Warning and Microsoft.AspNetCore.Http.Connections to Warning so the proxy's own event stream isn't drowned out.

Effort estimate

Work Hours
Hub + DTOs + broadcaster 6 h
Event fan-out wiring (existing log events) 3 h
AdminEndpointHost integration + appsettings binding 2 h
E2E test suite (6 tests using SignalR .NET client) 4 h
Documentation (this section graduates from proposal to fact; design.md update) 1 h
Total ~16 h

This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.


Implementation notes

Where rates and percentiles should live

Two reasonable answers:

  1. Compute in the proxy, expose pre-computed values in /status.json. Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change.
  2. Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates. Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.

Recommendation: ship Tier 1 rate metrics computed in-process for the operator who just opens http://<host>:8080/ in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.

Counter additions vs computed values

A few proposed KPIs require new counters in ProxyCounters or ServiceCounters, not just derivations:

  • pdus.lastForwardedUtc — new volatile long _lastForwardedTicks on ProxyCounters.
  • listener.boundRatio.* — new StateDurationTracker on PlcListenerSupervisor.
  • partialBcd.byClient / invalidBcd.byAddress — new ConcurrentDictionary<string,long> / ConcurrentDictionary<ushort,long> on PerPlcContext. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).
  • process.* — read fresh on every snapshot from Process.GetCurrentProcess() — no stored state.

Snapshot serialization cost

StatusResponse is built per-request to /status.json. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., invalidBcd.byAddress) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep /status.json under a few hundred KB even when something goes badly wrong.

Dashboard widget mapping (Grafana-style cheat sheet)

Widget Use for
Stat (big number) Service-wide aggregates, counts, latest timestamps
Gauge Ratios (availability, success rate, queue depth)
Sparkline Rates, percentiles, time-series trends
Stacked area Bandwidth, PDU-by-FC breakdown over time
Heatmap Per-address / per-client dimensional breakdowns
Cell-coloured table Per-PLC status (54 rows, one per PLC, columns of KPIs)

Backwards-compat policy

The fields currently in /status.json are frozen — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in design.md → "Status page" as the contract; new fields ship via PRs that update the contract first.

Cross-references

  • Field tables for what ships today: design.md → "Status page".
  • Stable log event names (some KPIs are derivable by tailing these): design.md → "Logging" event-name table.
  • Per-counter wiring lives in src/Mbproxy/Proxy/ProxyCounters.cs and src/Mbproxy/ServiceCounters.cs.
  • The status HTML page is rendered by src/Mbproxy/Admin/StatusHtmlRenderer.cs; the JSON DTOs and source-gen context live in src/Mbproxy/Admin/StatusDto.cs.