When two or more upstream clients send the same FC03/FC04 read while a matching request is already in flight on the same PLC's multiplexed backend socket, attach the late arrivals to the existing InFlightRequest .InterestedParties list instead of opening a second backend round-trip. The single backend response fans out to every attached party with each party's original MBAP TxId restored individually. Zero post-response staleness — coalescing operates entirely within the in-flight window (microseconds to ~10 ms typical); the proxy is NOT a cache layer. Headline mechanism: - New record struct CoalescingKey(UnitId, Fc, StartAddress, Qty) keys the per-PLC InFlightByKeyMap. FC03 and FC04 are separate Modbus tables and never share a key; different unit IDs never coalesce; writes (FC06/FC16) bypass the coalescing path entirely. - InFlightByKeyMap uses a simple lock around a Dictionary; atomic TryAttachOrCreate either appends a new party to the in-flight request's mutable List<InterestedParty> or invokes a factory to build a fresh entry. Per-entry MaxParties cap (default 32) bounds fan-out cost; past the cap, the next arrival opens a new entry. - PlcMultiplexer.OnUpstreamFrameAsync takes the coalescing path for FC03/FC04 when Mbproxy.Resilience.ReadCoalescing.Enabled. The factory closure does the Phase-9 work (allocate TxId, add to CorrelationMap); the channel send happens AFTER returning from TryAttachOrCreate so the map lock is not held across the async send. - Response fan-out in RunBackendReaderAsync removes the entry from InFlightByKeyMap before iterating InterestedParties, ensuring no concurrent attach can mutate the list during iteration. - Cascade + watchdog paths also drain the key map so a stale entry cannot outlive its backend round-trip. Counter accounting balance (per snapshot): CoalescedHitCount + CoalescedMissCount equals total FC03 + FC04 requests since startup. Even with coalescing disabled, every read still bumps Miss so dashboard math stays balanced. New surface (additive only): - src/Mbproxy/Proxy/Multiplexing/CoalescingKey.cs - src/Mbproxy/Proxy/Multiplexing/InFlightByKeyMap.cs - src/Mbproxy/Proxy/Multiplexing/CoalescingLogEvents.cs - ReadCoalescingOptions on ResilienceOptions - CoalescedHitCount / CoalescedMissCount / CoalescedResponseToDeadUpstream counters surfaced on /status.json per PLC and as a compact "Coal" cell on the HTML status page. Phase 9 test patch: TwoUpstreams_ProxyTxIds_AreDistinct_OnTheWire previously read the same register from both clients (which now coalesces). Patched to read two different addresses so the test still proves distinct backend TxIds without violating the coalescing contract. Tests added: 24 new (19 unit + 5 E2E): - CoalescingKeyTests (5) - InFlightByKeyMapTests (6, includes concurrent stress) - ReadCoalescingTests (8, stub-backend with deterministic delay) - ReadCoalescingE2ETests (5, pymodbus simulator; coalescing-active during overlap is proven against the stub, not the sim, due to pymodbus 3.13's known concurrent-frame bug) Total: 325 tests passing (282 unit + 43 E2E). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
32 KiB
mbproxy — Dashboard KPI catalogue
Recommended additions to the /status.json and / admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's /status.json exposes raw cumulative counters; this doc describes what's typically also expected when those counters land in Grafana / Wonderware / a custom HMI.
Scope. This is a proposal, not a contract. The endpoint shape settled in design.md → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.
Reading guide. Each KPI has:
- Name — short identifier matching the proxy's existing camelCase convention.
- Definition — what the number means.
- Source — where the value comes from (existing counter, new counter, derived).
- Widget — typical dashboard visualisation.
- Alert — common threshold or anomaly rule (where applicable).
- Effort — implementation cost in hours (rough order-of-magnitude).
What's exposed today (recap)
For context — every recommended addition below is in addition to this list. Today's /status.json carries:
| Group | Fields |
|---|---|
| Service | uptimeSeconds, version, configLastReloadUtc, configReloadCount, configReloadRejectedCount |
| Listeners | bound, configured |
| Per-PLC listener | state, lastBindError, recoveryAttempts |
| Per-PLC clients | connected, remoteEndpoints[] (remote, connectedAtUtc, pdusForwarded) |
| Per-PLC PDUs | forwarded, byFc.{fc03,fc04,fc06,fc16,other}, rewrittenSlots, partialBcdWarnings |
| Per-PLC backend | connectsSuccess, connectsFailed, exceptionsByCode.{code01..code04}, lastRoundTripMs, inFlight, maxInFlight, txIdWraps, disconnectCascades, queueDepth, coalescedHitCount, coalescedMissCount, coalescedResponseToDeadUpstream |
| Per-PLC bytes | upstreamIn, upstreamOut |
Counters are cumulative since process start. A restart resets them.
Tier 1 — strongly recommended for production
These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."
1.1 Rate metrics (per-PLC and fleet-wide)
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
pdus.ratePerSec.last1m |
PDU rate over the last 60 s | New per-PLC ring buffer (60 × 1 s samples) | Sparkline per PLC | None — informational | 4 h |
pdus.ratePerSec.last5m |
Same over 5 min | Same buffer at 300 s | Sparkline | None | shared |
errors.ratePerMin |
Sum of exceptionsByCode.* + partialBcdWarnings + invalidBcdWarnings per minute |
Derived | Stat tile per PLC | > 10/min → page | 2 h |
bytes.ratePerSec.up / .down |
Bandwidth each direction | Derived from bytesUpstreamIn/Out deltas |
Stacked area | None — informational | 2 h |
fleet.totalPdusPerSec |
Sum of all PLCs' rates | Aggregate | Single number, big | None | 1 h |
Why this matters. Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing rate(pdus_forwarded[1m]) on a 54-row fleet is the single most informative widget on the dashboard.
Implementation note. Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in /status.json directly, add a per-PLC Mbproxy.Proxy.RateTracker with a fixed-size circular buffer of 60 one-second samples and expose RatePerSec1m, RatePerSec5m.
1.2 Latency percentiles (replacing the bare EWMA)
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.roundTripMs.p50 |
Median backend round-trip over last 1 min | New per-PLC reservoir sample (size 256) | Line chart, per-PLC | None | 6 h |
backend.roundTripMs.p95 |
95th percentile | Same reservoir | Line chart | > 500 ms sustained 5 min → warn | shared |
backend.roundTripMs.p99 |
99th percentile | Same reservoir | Line chart | > 2 s sustained 5 min → page | shared |
backend.roundTripMs.max1m |
Slowest single PDU in last 1 min | Same reservoir | Stat tile | > 5 s → page | shared |
Why this matters. The existing lastRoundTripMs is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.
Implementation note. Use Mbproxy.Proxy.LatencyReservoir — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).
1.3 Per-PLC availability ratio
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
listener.boundRatio.last1h |
Fraction of time in bound state over last hour |
New per-supervisor state-time tracker | Gauge per PLC | < 0.99 → warn, < 0.95 → page | 4 h |
listener.boundRatio.sinceStart |
Fraction over process lifetime | Same tracker | Gauge | < 0.999 → warn | shared |
listener.timeInRecoveringMs.last1h |
Total time spent recovering in last hour | Same tracker | Stat tile | > 60s → warn | shared |
Why this matters. recoveryAttempts tells you how many times something has flapped, but not how much downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.
Implementation note. Each PlcListenerSupervisor already has a state machine. Add a StateDurationTracker that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.
1.4 Liveness / staleness signals
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
pdus.lastForwardedUtc |
Wall time of the most recent forwarded PDU | New _lastForwardedTimestamp per PLC |
Stat tile | now - value > 5 min AND clients.connected > 0 → page |
1 h |
clients.lastActivityUtc |
Per-client last-PDU timestamp | Already implicit; expose explicitly | Per-row in remoteEndpoints | None | 1 h |
staleClients.count |
Connected clients with no PDUs in last 5 min | Derived | Stat tile | > 0 → informational | 1 h |
Why this matters. Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with clients.connected = 2 but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.
1.5 Service-wide fleet aggregates
These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
fleet.plcsHealthy |
Count of PLCs in bound state with no errors in last 5 min |
Aggregate | Big number, green | < listeners.configured - 2 → warn |
2 h |
fleet.plcsRecovering |
Count in recovering state |
Aggregate | Big number, orange | > 0 → informational | shared |
fleet.plcsStopped |
Count in stopped state |
Aggregate | Big number, grey | > 0 → page | shared |
fleet.plcsWithActiveErrors |
Count with errors.ratePerMin > 0 |
Aggregate | Big number, red | > 0 → page | shared |
fleet.totalClientsConnected |
Sum of clients.connected |
Aggregate | Stat tile | None | 1 h |
fleet.totalRewrittenSlotsPerSec |
Sum of rewrite rates | Aggregate + derived | Sparkline | None | shared |
Why this matters. A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.
1.6 Multiplexer state — shipped in Phase 9
The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.inFlightCount |
Current in-flight Modbus requests on this PLC's backend connection | Phase-9 counter | Sparkline per PLC | Sustained > 100 → investigate (high churn or slow backend) | (in Phase 9 scope) |
backend.maxInFlight |
Peak in-flight count observed since process start | Phase-9 counter | Stat tile per PLC | Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load) | (in Phase 9 scope) |
backend.txIdWraps |
Times the TxId allocator has wrapped 0xFFFF → 0x0000 | Phase-9 counter | Stat tile per PLC | Sudden increase rate → very high in-flight churn; investigate fairness | (in Phase 9 scope) |
backend.queueDepth |
Current outbound channel depth (frames queued for the backend writer) | Phase-9 counter | Sparkline per PLC | Sustained > 50 → backend is slower than upstream demand; latency rising | (in Phase 9 scope) |
backend.disconnectCascades |
Total upstream clients closed due to backend disconnects | Phase-9 counter | Stat tile per PLC | Spike → network instability; correlate with mbproxy.backend.failed events |
(in Phase 9 scope) |
Why this matters. Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's lastRoundTripMs measures wire latency only; queue depth reveals proxy-side backlog).
1.7 Read coalescing — shipped in Phase 10
Same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric. coalescedHitCount + coalescedMissCount equals total FC03/04 request count per snapshot — the math always balances.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.coalescedHitCount |
FC03/04 requests attached to an already-in-flight peer | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
backend.coalescedMissCount |
FC03/04 requests that created a fresh backend round-trip | Phase-10 counter | Sparkline | None — trend-watch | (in Phase 10 scope) |
backend.coalescingRatio |
Hit / (Hit + Miss) over the trailing window |
Derived (dashboard) | Stat tile per PLC | None; a low ratio just means clients aren't synchronised on the same registers — informational | (in Phase 10 scope) |
backend.coalescedResponseToDeadUpstream |
Fan-out responses dropped because the attached upstream disconnected mid-flight | Phase-10 counter | Stat tile per PLC | Spike → client churn during traffic burst; usually not actionable (Tier 2 priority) | (in Phase 10 scope) |
Why this matters. Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.
1.8 Response cache — requires Phase 11
After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
backend.cacheHitCount |
FC03/04 requests served from the cache | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
backend.cacheMissCount |
FC03/04 requests that fell through to the backend (or coalescing) | Phase-11 counter | Sparkline per PLC | None — informational | (in Phase 11 scope) |
backend.cacheHitRatio |
Hit / (Hit + Miss) for cache-eligible reads |
Derived (dashboard) | Stat tile per PLC | None; informs whether TTL tuning is worthwhile | (in Phase 11 scope) |
backend.cacheInvalidations |
Cache entries invalidated by FC06/FC16 write responses | Phase-11 counter | Stat tile per PLC | High rate → many writes to cached addresses; consider reducing TTL on those tags | (in Phase 11 scope) |
Why this matters. Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.
Tier 2 — nice-to-have
Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.
2.1 Connection-cap saturation warning
Status: superseded by Phase 9. This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is
backend.inFlightCount(Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
clients.atCapWarning |
Boolean: clients.connected >= 3 (1 short of ECOM100's 4-client cap) |
Derived | Cell highlight | True → warn | 1 h |
clients.atCapBlocked |
Boolean: clients.connected >= 4 (cap reached) |
Derived | Cell highlight | True → page | shared |
Why this mattered (pre-Phase-9). The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see design.md → "Connection model" and DL260/dl205.md → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.
2.2 Error breakdown / heatmap
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
partialBcd.byClient |
Count of partial-BCD warnings grouped by client remote endpoint | New per-client counter | Top-N list | Top-1 > 100/hr → ops should check the client's tag definition | 3 h |
invalidBcd.byAddress |
Count of invalid-BCD events grouped by Modbus address | New per-address counter (small map) | Heatmap | Single address with persistent rate → broken PLC logic | 4 h |
exceptions.byCodeRate |
Per-exception-code rate over 5 min | Derived from exceptionsByCode.* |
Stacked bar | Code 04 (Slave Failure) spike → PLC in PROGRAM mode? | 2 h |
Why this matters. Once you've seen partialBcdWarnings = 1247, the next question is which client and which tag. Without dimensional breakdown, you have to ssh into the log file to find out.
2.3 Hot-reload cadence
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
config.reloadsPerHour |
Reload events per hour | Derived from configReloadCount |
Sparkline | > 10/hr → unusual; misconfig loop? | 1 h |
config.lastReloadDelta |
Summary of what changed on last reload | Already in mbproxy.config.reload.applied event; surface here |
Text snippet | None — informational | 2 h |
Why this matters. Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.
2.4 Memory / process health
| KPI | Definition | Source | Widget | Alert | Effort |
|---|---|---|---|---|---|
process.workingSetMb |
Process.GetCurrentProcess().WorkingSet64 / 1MB |
New | Stat tile | > 1024 MB → warn (54 PLCs shouldn't need that much) | 0.5 h |
process.gcCollections.gen0/1/2 |
GC counts per generation | GC.CollectionCount(n) |
Sparkline | Gen-2 frequency → memory pressure | 0.5 h |
process.threadCount |
Process.Threads.Count |
New | Stat tile | > 200 → leak? | 0.5 h |
Why this matters. A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one Process API call, no perf overhead.
Real-time updates via SignalR
Today's status surface is poll-based: the HTML page uses a 5-second meta-refresh, and Prometheus / custom HMI scrapers hit /status.json on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a live fleet dashboard with many panels open, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping bound → recovering) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.
The recommendation is additive, not replacement. Keep /status.json for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.
Why this is cheap to add
The Microsoft.AspNetCore.App framework reference that Phase 07 added to the csproj already includes Microsoft.AspNetCore.SignalR — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on Mbproxy.AdminPort. No additional port, no additional listener supervision, no additional shutdown path.
Architecture
┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐ │
ConfigReconciler ───────────┤ │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘ (background loop + │
immediate-push paths) └─→ Dashboard C (subscribed to "service")
StatusHub : Hub— the SignalR endpoint mounted at/hub/statusonAdminPort. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.StatusBroadcaster : IHostedService— the background pusher. Holds aTimer(orPeriodicTimer) that ticks atPushIntervalMs(default 1000 ms), builds aStatusResponsevia the existingStatusSnapshotBuilder, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposesPushEventAsync(name, props)for the immediate-push paths.- Immediate-push wiring — the existing log events (
mbproxy.listener.recovered,mbproxy.config.reload.applied,mbproxy.backend.failed,mbproxy.rewrite.partial_bcd, etc.) gain a fan-out call tobroadcaster.PushEventAsync(...)so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.
Hub contract
Hub URL: https://<host>:<AdminPort>/hub/status
Hub groups — clients subscribe to scopes; the server broadcasts to matching groups:
| Group | Receives |
|---|---|
all |
Every update for every PLC + every service-level event |
service |
Service-level events only (mbproxy.config.*, mbproxy.admin.*, mbproxy.startup.*, mbproxy.shutdown.*) |
plc:<Name> |
One PLC's snapshots + that PLC's events |
Server-side methods (client → server):
| Method | Purpose |
|---|---|
Task SubscribeFleet() |
Join group all |
Task SubscribeService() |
Join group service |
Task SubscribePlc(string name) |
Join group plc:<name> after validating that name exists in current options |
Task Unsubscribe() |
Leave every group; the connection stays open but receives nothing |
Client-side callbacks (server → client, named On* per SignalR convention):
| Callback | Payload | When |
|---|---|---|
OnSnapshot(StatusResponse snapshot) |
Full snapshot of the relevant scope (all, service, or a single PLC) |
Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect |
OnPatch(StatusPatch patch) |
Delta of fields that changed since the last push | Periodic — every PushIntervalMs if anything changed; skipped if nothing changed |
OnEvent(StatusEvent ev) |
Single discrete event: { name, levelString, plc?, propertiesJson, timestampUtc } |
Immediately — fan-out from the existing [LoggerMessage] event call sites |
StatusPatch carries only the fields that changed since the previous push: it's a Dictionary<string, JsonElement> keyed by JSON path (e.g., "plcs[2].pdus.forwarded", "plcs[2].listener.state"). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.
What gets pushed, and when
| Update kind | Cadence | Volume per PLC | Channel |
|---|---|---|---|
| Counter increments (PDUs, bytes, rewrites) | Every PushIntervalMs if changed; coalesced |
1 patch / push tick / subscribed group | OnPatch |
State transitions (bound ↔ recovering ↔ stopped) |
Immediate | 1 event + 1 patch | OnEvent + OnPatch |
| Discrete log events at level ≥ Info from the stable vocabulary | Immediate | 1 event per occurrence | OnEvent |
| Hot-reload applied / rejected | Immediate | 1 event with propertiesJson summary |
OnEvent |
| Periodic full snapshot | Every 60 s | 1 full snapshot | OnSnapshot |
The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.
Configuration
Extend appsettings.json with:
"Mbproxy": {
// ... existing keys ...
"Admin": {
"SignalR": {
"Enabled": true,
"PushIntervalMs": 1000, // patch cadence
"FullSnapshotIntervalMs": 60000, // periodic re-baseline
"MaxConcurrentClients": 32, // refuse new connections beyond this
"MaxGroupsPerClient": 8 // anti-runaway-subscription guard
}
}
}
Defaults make the feature opt-in-able-by-omission: if SignalR.Enabled = false, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.
Implementation outline
- Hub class —
src/Mbproxy/Admin/StatusHub.cs. InheritsHub. Implements the fourSubscribe*/Unsubscribemethods.OnConnectedAsyncrejects ifContext.Items.Count > MaxConcurrentClients(track in a staticConcurrentDictionary<string, byte>indexed byConnectionId). - Broadcaster —
src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService. Constructor takesIHubContext<StatusHub>,StatusSnapshotBuilder,IOptionsMonitor<MbproxyOptions>. The push loop is awhile (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... }body — wins overTimerfor cancellation correctness. - DTOs —
StatusPatchandStatusEventrecords added toStatusDto.cs, registered with the source-genStatusJsonContext. - Event fan-out — the existing
[LoggerMessage]partial methods stay; add a thinRealtimeLogEventswrapper class that logs AND callsbroadcaster.PushEventAsync(...). Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable. - Hub mapping —
AdminEndpointHostaddsapp.MapHub<StatusHub>("/hub/status")ifSignalR.Enabled. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint. - Shutdown —
StatusBroadcaster.StopAsynccancels its pump and the hub'sDisposechain handles connection teardown. The existingShutdownCoordinatordeadline applies.
Test approach
Use the Microsoft.AspNetCore.SignalR.Client package (NuGet) in the test csproj only. Pattern:
[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
// Arrange: start host on a random AdminPort, build a SignalR client.
var connection = new HubConnectionBuilder()
.WithUrl($"http://localhost:{adminPort}/hub/status")
.Build();
var patches = new ConcurrentQueue<StatusPatch>();
connection.On<StatusPatch>("OnPatch", patches.Enqueue);
await connection.StartAsync(TestContext.Current.CancellationToken);
await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);
// Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
// ... drive request through proxy ...
// Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
var deadline = DateTime.UtcNow.AddMilliseconds(500);
while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
await Task.Delay(20, TestContext.Current.CancellationToken);
patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}
Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.
Coverage targets for the new tests:
SignalR_Subscribe_DeliversInitialSnapshotSignalR_Patch_FiresWithinPushInterval_AfterCounterChangeSignalR_Event_FiresWithin_100ms_OfListenerRecoveredSignalR_SubscribePlc_OnlyDeliversThatPlcEvents— verifies group filteringSignalR_MaxConcurrentClients_RefusesExcess— capacity guardSignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs
Operational considerations
- Authentication / authorisation. Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
- Transport. SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
- Backpressure.
Hub.Clients.Group("all").SendAsyncdoes not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy. - Reconnection. The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
- Cardinality at scale. 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The
MaxConcurrentClientsguard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy. - CORS. If dashboards run on a different origin (likely), enable CORS on the admin app for
/hub/statusonly. AddAdminCors.AllowedOriginstoappsettings.jsonas an array of allowed origin strings; an empty array means same-origin only. - Logging. SignalR's internal logs are noisy at Information. In
appsettings.json, set theMicrosoft.AspNetCore.SignalRcategory toWarningandMicrosoft.AspNetCore.Http.ConnectionstoWarningso the proxy's own event stream isn't drowned out.
Effort estimate
| Work | Hours |
|---|---|
| Hub + DTOs + broadcaster | 6 h |
| Event fan-out wiring (existing log events) | 3 h |
| AdminEndpointHost integration + appsettings binding | 2 h |
| E2E test suite (6 tests using SignalR .NET client) | 4 h |
| Documentation (this section graduates from proposal to fact; design.md update) | 1 h |
| Total | ~16 h |
This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.
Implementation notes
Where rates and percentiles should live
Two reasonable answers:
- Compute in the proxy, expose pre-computed values in
/status.json. Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change. - Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates. Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.
Recommendation: ship Tier 1 rate metrics computed in-process for the operator who just opens http://<host>:8080/ in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.
Counter additions vs computed values
A few proposed KPIs require new counters in ProxyCounters or ServiceCounters, not just derivations:
pdus.lastForwardedUtc— newvolatile long _lastForwardedTicksonProxyCounters.listener.boundRatio.*— newStateDurationTrackeronPlcListenerSupervisor.partialBcd.byClient/invalidBcd.byAddress— newConcurrentDictionary<string,long>/ConcurrentDictionary<ushort,long>onPerPlcContext. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).process.*— read fresh on every snapshot fromProcess.GetCurrentProcess()— no stored state.
Snapshot serialization cost
StatusResponse is built per-request to /status.json. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., invalidBcd.byAddress) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep /status.json under a few hundred KB even when something goes badly wrong.
Dashboard widget mapping (Grafana-style cheat sheet)
| Widget | Use for |
|---|---|
| Stat (big number) | Service-wide aggregates, counts, latest timestamps |
| Gauge | Ratios (availability, success rate, queue depth) |
| Sparkline | Rates, percentiles, time-series trends |
| Stacked area | Bandwidth, PDU-by-FC breakdown over time |
| Heatmap | Per-address / per-client dimensional breakdowns |
| Cell-coloured table | Per-PLC status (54 rows, one per PLC, columns of KPIs) |
Backwards-compat policy
The fields currently in /status.json are frozen — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in design.md → "Status page" as the contract; new fields ship via PRs that update the contract first.
Cross-references
- Field tables for what ships today:
design.md→ "Status page". - Stable log event names (some KPIs are derivable by tailing these):
design.md→ "Logging" event-name table. - Per-counter wiring lives in
src/Mbproxy/Proxy/ProxyCounters.csandsrc/Mbproxy/ServiceCounters.cs. - The status HTML page is rendered by
src/Mbproxy/Admin/StatusHtmlRenderer.cs; the JSON DTOs and source-gen context live insrc/Mbproxy/Admin/StatusDto.cs.