Lands the design-contract pivot ahead of any cache implementation code so reviewers can evaluate the change to the "purely transparent proxy" stance independently of the Phase-11 code that depends on it. - docs/design.md: rewrite "What this is" / Read-coalescing / Failure-modes sections to acknowledge the opt-in cache; add new "Response cache (Phase 11)" section covering lookup order (cache -> coalesce -> backend), multi- tag range TTL = min, post-rewriter storage, address-range-overlap write invalidation, hot-reload PLC-wide flush, no-persistence, AllowLongTtl gate, and LRU-bounded capacity. Extend log event table with mbproxy.cache.* events. Extend per-PLC status field table with cacheHitCount / cacheMissCount / cacheInvalidations / cacheEntryCount / cacheBytes. Extend hot-reload propagation table with CacheTtlMs / Cache.* rows. - docs/kpi.md: graduate Tier 1.8 (response cache) from "requires Phase 11" to "shipped in Phase 11" and add Tier 2.4a cache-memory section. - CLAUDE.md (mbproxy): update Purpose paragraph and the Architecture headline bullets to reflect the transparent-by-default + opt-in-cache contract; flip "Implementation complete through Phase 10" to "through Phase 11". - install/mbproxy.config.template.json: add a fully-commented Mbproxy.Cache block and a CacheTtlMs example on a BcdTags.Global entry, with prominent staleness commentary documenting the design contract. No code changes in this commit - implementation lands in a follow-up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
34 KiB
mbproxy — Dashboard KPI catalogue
Recommended additions to the /status.json and / admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's /status.json exposes raw cumulative counters; this doc describes what's typically also expected when those counters land in Grafana / Wonderware / a custom HMI.
Scope. This is a proposal, not a contract. The endpoint shape settled in design.md → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.
Reading guide. Each KPI has:
- Name — short identifier matching the proxy's existing camelCase convention.
- Definition — what the number means.
- Source — where the value comes from (existing counter, new counter, derived).
- Widget — typical dashboard visualisation.
- Alert — common threshold or anomaly rule (where applicable).
- Effort — implementation cost in hours (rough order-of-magnitude).
What's exposed today (recap)
For context — every recommended addition below is in addition to this list. Today's /status.json carries:
| Group | Fields |
|---|---|
| Service | uptimeSeconds, version, configLastReloadUtc, configReloadCount, configReloadRejectedCount |
| Listeners | bound, configured |
| Per-PLC listener | state, lastBindError, recoveryAttempts |
| Per-PLC clients | connected, remoteEndpoints[] (remote, connectedAtUtc, pdusForwarded) |
| Per-PLC PDUs | forwarded, byFc.{fc03,fc04,fc06,fc16,other}, rewrittenSlots, partialBcdWarnings |
| Per-PLC backend | connectsSuccess, connectsFailed, exceptionsByCode.{code01..code04}, lastRoundTripMs, inFlight, maxInFlight, txIdWraps, disconnectCascades, queueDepth, coalescedHitCount, coalescedMissCount, coalescedResponseToDeadUpstream, cacheHitCount, cacheMissCount, cacheInvalidations, cacheEntryCount, cacheBytes |
| Per-PLC bytes | upstreamIn, upstreamOut |
Counters are cumulative since process start. A restart resets them.
Tier 1 — strongly recommended for production
These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."
1.1 Rate metrics (per-PLC and fleet-wide)
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
pdus.ratePerSec.last1m |
PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h |
pdus.ratePerSec.last5m |
Same over 5 min | Same buffer at 300 s | Sparkline | None | shared |
errors.ratePerMin |
Sum of exceptionsByCode.* + partialBcdWarnings + invalidBcdWarnings per minute |
Derived | Stat tile per PLC | > 10/min → page | 2 h |
bytes.ratePerSec.up / .down |
Bandwidth each direction | Derived from bytesUpstreamIn/Out deltas |
Stacked area | None — informational | 2 h |
fleet.totalPdusPerSec |
Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h |
Why this matters. Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing rate(pdus_forwarded[1m]) on a 54-row fleet is the single most informative widget on the dashboard.
Implementation note. Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in /status.json directly, add a per-PLC Mbproxy.Proxy.RateTracker with a fixed-size circular buffer of 60 one-second samples and expose RatePerSec1m, RatePerSec5m.
1.2 Latency percentiles (replacing the bare EWMA)
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.roundTripMs.p50 |
Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h |
backend.roundTripMs.p95 |
95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared |
backend.roundTripMs.p99 |
99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared |
backend.roundTripMs.max1m |
Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared |
Why this matters. The existing lastRoundTripMs is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.
Implementation note. Use Mbproxy.Proxy.LatencyReservoir — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).
1.3 Per-PLC availability ratio
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
listener.boundRatio.last1h |
Fraction of time in bound state over last hour |
New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h |
listener.boundRatio.sinceStart |
Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared |
listener.timeInRecoveringMs.last1h |
Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared |
Why this matters. recoveryAttempts tells you how many times something has flapped, but not how much downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.
Implementation note. Each PlcListenerSupervisor already has a state machine. Add a StateDurationTracker that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.
1.4 Liveness / staleness signals
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
pdus.lastForwardedUtc |
Wall time of the most recent forwarded PDU | New _lastForwardedTimestamp per PLC |
Stat tile | now - value > 5 min AND clients.connected > 0 → page |
1 h |
clients.lastActivityUtc |
Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h |
staleClients.count |
Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h |
Why this matters. Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with clients.connected = 2 but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.
1.5 Service-wide fleet aggregates
These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
fleet.plcsHealthy |
Count of PLCs in bound state with no errors in last 5 min |
Aggregate | Big number, green | < listeners.configured - 2 → warn |
2 h |
fleet.plcsRecovering |
Count in recovering state |
Aggregate | Big number, orange | > 0 → informational | shared |
fleet.plcsStopped |
Count in stopped state |
Aggregate | Big number, grey | > 0 → page | shared |
fleet.plcsWithActiveErrors |
Count with errors.ratePerMin > 0 |
Aggregate | Big number, red | > 0 → page | shared |
fleet.totalClientsConnected |
Sum of clients.connected |
Aggregate | Stat tile | None | 1 h |
fleet.totalRewrittenSlotsPerSec |
Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared |
Why this matters. A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.
1.6 Multiplexer state — shipped in Phase 9
The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.inFlightCount |
Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) |
backend.maxInFlight |
Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) |
backend.txIdWraps |
Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) |
backend.queueDepth |
Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) |
backend.disconnectCascades |
Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with mbproxy.backend.failed events |
(in Phase 9 scope) |
Why this matters. Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's lastRoundTripMs measures wire latency only; queue depth reveals proxy-side backlog).
1.7 Read coalescing — shipped in Phase 10
Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. coalescedHitCount + coalescedMissCount equals total FC03/04 request count per snapshot — the math always balances.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.coalescedHitCount |
FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
backend.coalescedMissCount |
FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
backend.coalescingRatio |
Hit / (Hit + Miss) over the trailing window |
Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) |
backend.coalescedResponseToDeadUpstream |
Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) |
Why this matters. Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
1.8 Response cache — shipped in Phase 11
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries. The cache is OFF by default — operators opt tags in by setting CacheTtlMs > 0 on a BcdTagOptions entry (or DefaultCacheTtlMs > 0 on a PlcOptions entry).
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.cacheHitCount |
FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
backend.cacheMissCount |
FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
backend.cacheHitRatio |
Hit / (Hit + Miss) for cache-eligible reads |
Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) |
backend.cacheInvalidations |
Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) |
Why this matters. Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.
Tier 2 — nice-to-have
Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.
2.1 Connection-cap saturation warning
Status: superseded by Phase 9. This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is
backend.inFlightCount(Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
clients.atCapWarning |
Boolean: clients.connected >= 3 (1 short of ECOM100's 4-client cap) |
Derived | Cell highlight | True → warn | 1 h |
clients.atCapBlocked |
Boolean: clients.connected >= 4 (cap reached) |
Derived | Cell highlight | True → page | shared |
Why this mattered (pre-Phase-9). The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see design.md → "Connection model" and DL260/dl205.md → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.
2.2 Error breakdown / heatmap
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
partialBcd.byClient |
Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h |
invalidBcd.byAddress |
Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h |
exceptions.byCodeRate |
Per-exception-code rate over 5 min | Derived from exceptionsByCode.* |
Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h |
Why this matters. Once you've seen partialBcdWarnings = 1247, the next question is which client and which tag. Without dimensional breakdown, you have to ssh into the log file to find out.
2.3 Hot-reload cadence
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
config.reloadsPerHour |
Reload events per hour | Derived from configReloadCount |
Sparkline | > 10/hr → unusual; misconfig loop? | 1 h |
config.lastReloadDelta |
Summary of what changed on last reload | Already in mbproxy.config.reload.applied event; surface here |
Text snippet | None — informational | 2 h |
Why this matters. Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
2.4a Response-cache memory — shipped in Phase 11
When the Phase-11 response cache is enabled on a busy PLC, operators want to know how much in-process memory the cache is consuming and whether the per-PLC MaxEntriesPerPlc cap is being exercised. Both are operator-actionable tuning signals for the cache capacity knob.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.cacheEntryCount |
Current per-PLC cache entry count (point-in-time) | Phase-11 snapshot | Sparkline per PLC | Sustained = MaxEntriesPerPlc → consider raising the cap |
(in Phase 11 scope) |
backend.cacheBytes |
Approximation of cached PDU bytes for this PLC | Phase-11 snapshot | Sparkline per PLC | Trending up on a steady-state poll cadence → unbounded growth bug; investigate | (in Phase 11 scope) |
Why this matters. Cache entries are short-lived (TTLs are typically seconds, not minutes). A cacheEntryCount that sits at MaxEntriesPerPlc for long stretches says "the LRU is constantly evicting" — either the workload has more distinct keys than the cap, or the TTL is so long that nothing expires before the LRU kicks. cacheBytes is the memory-side counter: a 54-PLC fleet at 1000 entries × 250 bytes/PDU ≈ 13 MB total cache, easily within budget; surfacing the number lets operators raise the cap confidently or notice a regression.
2.4 Memory / process health
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
process.workingSetMb |
Process.GetCurrentProcess().WorkingSet64 / 1MB |
New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h |
process.gcCollections.gen0/1/2 |
GC counts per generation | GC.CollectionCount(n) |
Sparkline | Gen-2 frequency → memory pressure | 0.5 h |
process.threadCount |
Process.Threads.Count |
New | Stat tile | > 200 → leak? | 0.5 h |
Why this matters. A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one Process API call, no perf overhead.
Real-time updates via SignalR
Today's status surface is poll-based: the HTML page uses a 5-second meta-refresh, and Prometheus / custom HMI scrapers hit /status.json on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a live fleet dashboard with many panels open, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping bound → recovering) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.
The recommendation is additive, not replacement. Keep /status.json for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.
Why this is cheap to add
The Microsoft.AspNetCore.App framework reference that Phase 07 added to the csproj already includes Microsoft.AspNetCore.SignalR — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on Mbproxy.AdminPort. No additional port, no additional listener supervision, no additional shutdown path.
Architecture
┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐ │
ConfigReconciler ───────────┤ │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘ (background loop + │
immediate-push paths) └─→ Dashboard C (subscribed to "service")
StatusHub : Hub— the SignalR endpoint mounted at/hub/statusonAdminPort. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.StatusBroadcaster : IHostedService— the background pusher. Holds aTimer(orPeriodicTimer) that ticks atPushIntervalMs(default 1000 ms), builds aStatusResponsevia the existingStatusSnapshotBuilder, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposesPushEventAsync(name, props)for the immediate-push paths.- Immediate-push wiring — the existing log events (
mbproxy.listener.recovered,mbproxy.config.reload.applied,mbproxy.backend.failed,mbproxy.rewrite.partial_bcd, etc.) gain a fan-out call tobroadcaster.PushEventAsync(...)so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.
Hub contract
Hub URL: https://<host>:<AdminPort>/hub/status
Hub groups — clients subscribe to scopes; the server broadcasts to matching groups:
| Group | Receives |
|---|---|
all |
Every update for every PLC + every service-level event |
service |
Service-level events only (mbproxy.config.*, mbproxy.admin.*, mbproxy.startup.*, mbproxy.shutdown.*) |
plc:<Name> |
One PLC's snapshots + that PLC's events |
Server-side methods (client → server):
| Method | Purpose |
|---|---|
Task SubscribeFleet() |
Join group all |
Task SubscribeService() |
Join group service |
Task SubscribePlc(string name) |
Join group plc:<name> after validating that name exists in current options |
Task Unsubscribe() |
Leave every group; the connection stays open but receives nothing |
Client-side callbacks (server → client, named On* per SignalR convention):
| Callback | Payload | When |
|---|---|---|
OnSnapshot(StatusResponse snapshot) |
Full snapshot of the relevant scope (all, service, or a single PLC) |
Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect |
OnPatch(StatusPatch patch) |
Delta of fields that changed since the last push | Periodic — every PushIntervalMs if anything changed; skipped if nothing changed |
OnEvent(StatusEvent ev) |
Single discrete event: { name, levelString, plc?, propertiesJson, timestampUtc } |
Immediately — fan-out from the existing [LoggerMessage] event call sites |
StatusPatch carries only the fields that changed since the previous push: it's a Dictionary<string, JsonElement> keyed by JSON path (e.g., "plcs[2].pdus.forwarded", "plcs[2].listener.state"). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.
What gets pushed, and when
| Update kind | Cadence | Volume per PLC | Channel |
|---|---|---|---|
| Counter increments (PDUs, bytes, rewrites) | Every PushIntervalMs if changed; coalesced |
1 patch / push tick / subscribed group | OnPatch |
State transitions (bound ↔ recovering ↔ stopped) |
Immediate | 1 event + 1 patch | OnEvent + OnPatch |
| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | OnEvent |
| Hot-reload applied / rejected | Immediate | 1 event with propertiesJson summary |
OnEvent |
| Periodic full snapshot | Every 60 s | 1 full snapshot | OnSnapshot |
The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.
Configuration
Extend appsettings.json with:
"Mbproxy": {
// ... existing keys ...
"Admin": {
"SignalR": {
"Enabled": true,
"PushIntervalMs": 1000, // patch cadence
"FullSnapshotIntervalMs": 60000, // periodic re-baseline
"MaxConcurrentClients": 32, // refuse new connections beyond this
"MaxGroupsPerClient": 8 // anti-runaway-subscription guard
}
}
}
Defaults make the feature opt-in-able-by-omission: if SignalR.Enabled = false, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.
Implementation outline
- Hub class —
src/Mbproxy/Admin/StatusHub.cs. InheritsHub. Implements the fourSubscribe*/Unsubscribemethods.OnConnectedAsyncrejects ifContext.Items.Count > MaxConcurrentClients(track in a staticConcurrentDictionary<string, byte>indexed byConnectionId). - Broadcaster —
src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService. Constructor takesIHubContext<StatusHub>,StatusSnapshotBuilder,IOptionsMonitor<MbproxyOptions>. The push loop is awhile (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }body — wins overTimerfor cancellation correctness. - DTOs —
StatusPatchandStatusEventrecords added toStatusDto.cs, registered with the source-genStatusJsonContext. - Event fan-out — the existing
[LoggerMessage]partial methods stay; add a thinRealtimeLogEventswrapper class that logs AND callsbroadcaster.PushEventAsync(...). Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable. - Hub mapping —
AdminEndpointHostaddsapp.MapHub<StatusHub>("/hub/status")ifSignalR.Enabled. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint. - Shutdown —
StatusBroadcaster.StopAsynccancels its pump and the hub'sDisposechain handles connection teardown. The existingShutdownCoordinatordeadline applies.
Test approach
Use the Microsoft.AspNetCore.SignalR.Client package (NuGet) in the test csproj only. Pattern:
[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
// Arrange: start host on a random AdminPort, build a SignalR client.
var connection = new HubConnectionBuilder()
.WithUrl($"http://localhost:{adminPort}/hub/status")
.Build();
var patches = new ConcurrentQueue<StatusPatch>();
connection.On<StatusPatch>("OnPatch", patches.Enqueue);
await connection.StartAsync(TestContext.Current.CancellationToken);
await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);
// Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
// ... drive request through proxy ...
// Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
var deadline = DateTime.UtcNow.AddMilliseconds(500);
while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
await Task.Delay(20, TestContext.Current.CancellationToken);
patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}
Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.
Coverage targets for the new tests:
SignalR_Subscribe_DeliversInitialSnapshotSignalR_Patch_FiresWithinPushInterval_AfterCounterChangeSignalR_Event_FiresWithin_100ms_OfListenerRecoveredSignalR_SubscribePlc_OnlyDeliversThatPlcEvents— verifies group filteringSignalR_MaxConcurrentClients_RefusesExcess— capacity guardSignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs
Operational considerations
- Authentication / authorisation. Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
- Transport. SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
- Backpressure.
Hub.Clients.Group("all").SendAsyncdoes not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy. - Reconnection. The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
- Cardinality at scale. 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The
MaxConcurrentClientsguard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy. - CORS. If dashboards run on a different origin (likely), enable CORS on the admin app for
/hub/statusonly. AddAdminCors.AllowedOriginstoappsettings.jsonas an array of allowed origin strings; an empty array means same-origin only. - Logging. SignalR's internal logs are noisy at Information. In
appsettings.json, set theMicrosoft.AspNetCore.SignalRcategory toWarningandMicrosoft.AspNetCore.Http.ConnectionstoWarningso the proxy's own event stream isn't drowned out.
Effort estimate
| Work | Hours |
|---|---|
| Hub + DTOs + broadcaster | 6 h |
| Event fan-out wiring (existing log events) | 3 h |
| AdminEndpointHost integration + appsettings binding | 2 h |
| E2E test suite (6 tests using SignalR .NET client) | 4 h |
| Documentation (this section graduates from proposal to fact; design.md update) | 1 h |
| Total | ~16 h |
This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.
Implementation notes
Where rates and percentiles should live
Two reasonable answers:
- Compute in the proxy, expose pre-computed values in
/status.json. Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change. - Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates. Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.
Recommendation: ship Tier 1 rate metrics computed in-process for the operator who just opens http://<host>:8080/ in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.
Counter additions vs computed values
A few proposed KPIs require new counters in ProxyCounters or ServiceCounters, not just derivations:
pdus.lastForwardedUtc— newvolatile long _lastForwardedTicksonProxyCounters.listener.boundRatio.*— newStateDurationTrackeronPlcListenerSupervisor.partialBcd.byClient/invalidBcd.byAddress— newConcurrentDictionary<string,long>/ConcurrentDictionary<ushort,long>onPerPlcContext. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).process.*— read fresh on every snapshot fromProcess.GetCurrentProcess()— no stored state.
Snapshot serialization cost
StatusResponse is built per-request to /status.json. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., invalidBcd.byAddress) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep /status.json under a few hundred KB even when something goes badly wrong.
Dashboard widget mapping (Grafana-style cheat sheet)
| Widget | Use for |
|---|---|
| Stat (big number) | Service-wide aggregates, counts, latest timestamps |
| Gauge | Ratios (availability, success rate, queue depth) |
| Sparkline | Rates, percentiles, time-series trends |
| Stacked area | Bandwidth, PDU-by-FC breakdown over time |
| Heatmap | Per-address / per-client dimensional breakdowns |
| Cell-coloured table | Per-PLC status (54 rows, one per PLC, columns of KPIs) |
Backwards-compat policy
The fields currently in /status.json are frozen — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in design.md → "Status page" as the contract; new fields ship via PRs that update the contract first.
Cross-references
- Field tables for what ships today:
design.md→ "Status page". - Stable log event names (some KPIs are derivable by tailing these):
design.md→ "Logging" event-name table. - Per-counter wiring lives in
src/Mbproxy/Proxy/ProxyCounters.csandsrc/Mbproxy/ServiceCounters.cs. - The status HTML page is rendered by
src/Mbproxy/Admin/StatusHtmlRenderer.cs; the JSON DTOs and source-gen context live insrc/Mbproxy/Admin/StatusDto.cs.