Files

T

Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)

Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-14 01:49:35 -04:00

32 KiB

Raw Blame History

mbproxy — Dashboard KPI catalogue

Recommended additions to the /status.json and / admin endpoint to make a production fleet dashboard genuinely useful, grouped by tier. Today's /status.json exposes raw cumulative counters; this doc describes what's typically also expected when those counters land in Grafana / Wonderware / a custom HMI.

Scope. This is a proposal, not a contract. The endpoint shape settled in design.md → "Status page" is what ships today; the items below are dashboard-side derivatives or new counters that operators of comparable Modbus / SCADA proxy fleets typically expect.

Reading guide. Each KPI has:

Name — short identifier matching the proxy's existing camelCase convention.
Definition — what the number means.
Source — where the value comes from (existing counter, new counter, derived).
Widget — typical dashboard visualisation.
Alert — common threshold or anomaly rule (where applicable).
Effort — implementation cost in hours (rough order-of-magnitude).

What's exposed today (recap)

For context — every recommended addition below is in addition to this list. Today's /status.json carries:

Group	Fields
Service	`uptimeSeconds`, `version`, `configLastReloadUtc`, `configReloadCount`, `configReloadRejectedCount`
Listeners	`bound`, `configured`
Per-PLC listener	`state`, `lastBindError`, `recoveryAttempts`
Per-PLC clients	`connected`, `remoteEndpoints[]` (remote, connectedAtUtc, pdusForwarded)
Per-PLC PDUs	`forwarded`, `byFc.{fc03,fc04,fc06,fc16,other}`, `rewrittenSlots`, `partialBcdWarnings`
Per-PLC backend	`connectsSuccess`, `connectsFailed`, `exceptionsByCode.{code01..code04}`, `lastRoundTripMs`
Per-PLC bytes	`upstreamIn`, `upstreamOut`

Counters are cumulative since process start. A restart resets them.

Tier 1 — strongly recommended for production

These are the additions that, in practice, are the difference between "I can see the proxy is up" and "I can run a 54-PLC fleet from this dashboard."

1.1 Rate metrics (per-PLC and fleet-wide)

KPI	Definition	Source	Widget	Alert	Effort
`pdus.ratePerSec.last1m`	PDU rate over the last 60 s	New per-PLC ring buffer (60 × 1 s samples)	Sparkline per PLC	None — informational	4 h
`pdus.ratePerSec.last5m`	Same over 5 min	Same buffer at 300 s	Sparkline	None	shared
`errors.ratePerMin`	Sum of `exceptionsByCode.*` + `partialBcdWarnings` + `invalidBcdWarnings` per minute	Derived	Stat tile per PLC	> 10/min → page	2 h
`bytes.ratePerSec.up` / `.down`	Bandwidth each direction	Derived from `bytesUpstreamIn/Out` deltas	Stacked area	None — informational	2 h
`fleet.totalPdusPerSec`	Sum of all PLCs' rates	Aggregate	Single number, big	None	1 h

Why this matters. Cumulative counters answer "did anything ever happen" but not "is anything happening right now." A grafana panel computing rate(pdus_forwarded[1m]) on a 54-row fleet is the single most informative widget on the dashboard.

Implementation note. Rate-from-counter computation can live entirely on the dashboard side (Prometheus/Grafana handles it natively). If we want them in /status.json directly, add a per-PLC Mbproxy.Proxy.RateTracker with a fixed-size circular buffer of 60 one-second samples and expose RatePerSec1m, RatePerSec5m.

1.2 Latency percentiles (replacing the bare EWMA)

KPI	Definition	Source	Widget	Alert	Effort
`backend.roundTripMs.p50`	Median backend round-trip over last 1 min	New per-PLC reservoir sample (size 256)	Line chart, per-PLC	None	6 h
`backend.roundTripMs.p95`	95th percentile	Same reservoir	Line chart	> 500 ms sustained 5 min → warn	shared
`backend.roundTripMs.p99`	99th percentile	Same reservoir	Line chart	> 2 s sustained 5 min → page	shared
`backend.roundTripMs.max1m`	Slowest single PDU in last 1 min	Same reservoir	Stat tile	> 5 s → page	shared

Why this matters. The existing lastRoundTripMs is an EWMA — useful, but it smooths away tail events. A single PLC misbehaving with bursty 5-second responses won't show up in EWMA but is obvious in p99. Modbus clients have hard timeouts (typically 3 s); knowing p99 lets you set them confidently.

Implementation note. Use Mbproxy.Proxy.LatencyReservoir — a 256-sample reservoir with Vitter's Algorithm R for unbiased sampling under arbitrary throughput. Don't store every sample (a busy PLC at 100 PDU/s × 60 s = 6,000 samples/min × 54 PLCs = 324K samples/min, too much).

1.3 Per-PLC availability ratio

KPI	Definition	Source	Widget	Alert	Effort
`listener.boundRatio.last1h`	Fraction of time in `bound` state over last hour	New per-supervisor state-time tracker	Gauge per PLC	< 0.99 → warn, < 0.95 → page	4 h
`listener.boundRatio.sinceStart`	Fraction over process lifetime	Same tracker	Gauge	< 0.999 → warn	shared
`listener.timeInRecoveringMs.last1h`	Total time spent recovering in last hour	Same tracker	Stat tile	> 60s → warn	shared

Why this matters. recoveryAttempts tells you how many times something has flapped, but not how much downtime that represented. A PLC that recovers in 1 s once an hour is healthy; one that recovers in 90 s every 10 min is degraded. The ratio captures this directly.

Implementation note. Each PlcListenerSupervisor already has a state machine. Add a StateDurationTracker that timestamps every state transition and accumulates total time in each state. Surface the ratio over a sliding window.

1.4 Liveness / staleness signals

KPI	Definition	Source	Widget	Alert	Effort
`pdus.lastForwardedUtc`	Wall time of the most recent forwarded PDU	New `_lastForwardedTimestamp` per PLC	Stat tile	`now - value > 5 min AND clients.connected > 0` → page	1 h
`clients.lastActivityUtc`	Per-client last-PDU timestamp	Already implicit; expose explicitly	Per-row in remoteEndpoints	None	1 h
`staleClients.count`	Connected clients with no PDUs in last 5 min	Derived	Stat tile	> 0 → informational	1 h

Why this matters. Operators want to know "is this PLC actually doing anything?" not just "is the listener bound?" A PLC with clients.connected = 2 but no PDU in 10 minutes is suspicious — either the clients are dead, the network is broken, or the HMI is misconfigured.

1.5 Service-wide fleet aggregates

These are single-number widgets that surface fleet health at a glance, typically rendered as large stat tiles in the header of the dashboard.

KPI	Definition	Source	Widget	Alert	Effort
`fleet.plcsHealthy`	Count of PLCs in `bound` state with no errors in last 5 min	Aggregate	Big number, green	< `listeners.configured - 2` → warn	2 h
`fleet.plcsRecovering`	Count in `recovering` state	Aggregate	Big number, orange	> 0 → informational	shared
`fleet.plcsStopped`	Count in `stopped` state	Aggregate	Big number, grey	> 0 → page	shared
`fleet.plcsWithActiveErrors`	Count with `errors.ratePerMin > 0`	Aggregate	Big number, red	> 0 → page	shared
`fleet.totalClientsConnected`	Sum of `clients.connected`	Aggregate	Stat tile	None	1 h
`fleet.totalRewrittenSlotsPerSec`	Sum of rewrite rates	Aggregate + derived	Sparkline	None	shared

Why this matters. A 54-row table is hard to scan. A "47 healthy / 5 recovering / 2 errors" header lets the operator know whether to even look at the table.

1.6 Multiplexer state — shipped in Phase 9

The proxy holds one backend socket per PLC and multiplexes upstream clients via MBAP TxId rewriting. The 4-client ECOM cap is no longer a meaningful operational concern; the new saturation surface is the 16-bit TxId space and the per-PLC outbound queue depth.

KPI	Definition	Source	Widget	Alert	Effort
`backend.inFlightCount`	Current in-flight Modbus requests on this PLC's backend connection	Phase-9 counter	Sparkline per PLC	Sustained > 100 → investigate (high churn or slow backend)	(in Phase 9 scope)
`backend.maxInFlight`	Peak in-flight count observed since process start	Phase-9 counter	Stat tile per PLC	Approaches 65,000 → page (TxId saturation imminent — realistic only under pathological load)	(in Phase 9 scope)
`backend.txIdWraps`	Times the TxId allocator has wrapped 0xFFFF → 0x0000	Phase-9 counter	Stat tile per PLC	Sudden increase rate → very high in-flight churn; investigate fairness	(in Phase 9 scope)
`backend.queueDepth`	Current outbound channel depth (frames queued for the backend writer)	Phase-9 counter	Sparkline per PLC	Sustained > 50 → backend is slower than upstream demand; latency rising	(in Phase 9 scope)
`backend.disconnectCascades`	Total upstream clients closed due to backend disconnects	Phase-9 counter	Stat tile per PLC	Spike → network instability; correlate with `mbproxy.backend.failed` events	(in Phase 9 scope)

Why this matters. Multiplexing concentrates connection risk: a single backend disconnect now cascades to every attached upstream client. The cascade counter quantifies that blast radius. Queue depth is the new latency leading indicator (today's lastRoundTripMs measures wire latency only; queue depth reveals proxy-side backlog).

1.7 Read coalescing — requires Phase 10

After Phase 10 ships, same-key FC03/04 reads within the in-flight window attach to one another instead of generating duplicate backend requests. The coalescing ratio is the headline metric.

KPI	Definition	Source	Widget	Alert	Effort
`backend.coalescedHitCount`	FC03/04 requests attached to an already-in-flight peer	Phase-10 counter	Sparkline	None — trend-watch	(in Phase 10 scope)
`backend.coalescedMissCount`	FC03/04 requests that created a fresh backend round-trip	Phase-10 counter	Sparkline	None — trend-watch	(in Phase 10 scope)
`backend.coalescingRatio`	`Hit / (Hit + Miss)` over the trailing window	Derived (dashboard)	Stat tile per PLC	None; a low ratio just means clients aren't synchronised on the same registers — informational	(in Phase 10 scope)
`backend.coalescedResponseToDeadUpstream`	Fan-out responses dropped because the attached upstream disconnected mid-flight	Phase-10 counter	Stat tile per PLC	Spike → client churn during traffic burst; usually not actionable	(in Phase 10 scope)

Why this matters. Coalescing-ratio is the "how much PLC traffic did we save" metric. A 60% ratio means 60% of FC03/04 reads landed on an existing in-flight request — that's roughly 60% reduction in backend PDU rate vs the pre-Phase-10 model. The dead-upstream counter is a churn indicator that's invisible in any other metric.

1.8 Response cache — requires Phase 11

After Phase 11 ships, FC03/04 responses for opt-in tags are cached with a per-tag TTL. Cache hits serve from in-process memory without backend traffic; FC06/FC16 write responses invalidate overlapping entries.

KPI	Definition	Source	Widget	Alert	Effort
`backend.cacheHitCount`	FC03/04 requests served from the cache	Phase-11 counter	Sparkline per PLC	None — informational	(in Phase 11 scope)
`backend.cacheMissCount`	FC03/04 requests that fell through to the backend (or coalescing)	Phase-11 counter	Sparkline per PLC	None — informational	(in Phase 11 scope)
`backend.cacheHitRatio`	`Hit / (Hit + Miss)` for cache-eligible reads	Derived (dashboard)	Stat tile per PLC	None; informs whether TTL tuning is worthwhile	(in Phase 11 scope)
`backend.cacheInvalidations`	Cache entries invalidated by FC06/FC16 write responses	Phase-11 counter	Stat tile per PLC	High rate → many writes to cached addresses; consider reducing TTL on those tags	(in Phase 11 scope)

Why this matters. Cache-hit-ratio is the operator's ROI metric — TTLs that yield low hit-ratios are wasted staleness. The invalidation counter reveals writes-to-cached-reads churn: a high rate suggests the cache is invalidating itself constantly, meaning the TTL configuration isn't matching real access patterns. Both are operational tuning signals, not alerts.

Tier 2 — nice-to-have

Reach for these once Tier 1 is solid. They add depth for specific operational scenarios.

2.1 Connection-cap saturation warning

Status: superseded by Phase 9. This KPI tracked the H2-ECOM100's 4-concurrent-TCP-client cap, which was the headline operational ceiling under the pre-Phase-9 1:1 connection model. After Phase 9 ships, the proxy holds exactly one backend socket per PLC regardless of how many upstream clients connect — the 4-client cap on the ECOM is no longer reachable from the upstream side. The closest post-Phase-9 equivalent is backend.inFlightCount (Tier 1.6) against the 65,535 TxId-allocator ceiling, but that's realistically unreachable under any normal load. Keep this section as historical context only; do not implement it on a Phase-9 (or later) deployment.

KPI	Definition	Source	Widget	Alert	Effort
`clients.atCapWarning`	Boolean: `clients.connected >= 3` (1 short of ECOM100's 4-client cap)	Derived	Cell highlight	True → warn	1 h
`clients.atCapBlocked`	Boolean: `clients.connected >= 4` (cap reached)	Derived	Cell highlight	True → page	shared

Why this mattered (pre-Phase-9). The H2-ECOM100's 4-simultaneous-TCP-client cap was a documented operational ceiling (see design.md → "Connection model" and DL260/dl205.md → "Behavioral Oddities"). When 4 clients were connected, the 5th would see backend connect failures. Surfacing this proactively let ops kick a stale client before incoming clients failed. Phase 9 eliminates the underlying problem; this KPI exists in the catalogue only as a historical reference for pre-Phase-9 deployments.

2.2 Error breakdown / heatmap

KPI	Definition	Source	Widget	Alert	Effort
`partialBcd.byClient`	Count of partial-BCD warnings grouped by client remote endpoint	New per-client counter	Top-N list	Top-1 > 100/hr → ops should check the client's tag definition	3 h
`invalidBcd.byAddress`	Count of invalid-BCD events grouped by Modbus address	New per-address counter (small map)	Heatmap	Single address with persistent rate → broken PLC logic	4 h
`exceptions.byCodeRate`	Per-exception-code rate over 5 min	Derived from `exceptionsByCode.*`	Stacked bar	Code 04 (Slave Failure) spike → PLC in PROGRAM mode?	2 h

Why this matters. Once you've seen partialBcdWarnings = 1247, the next question is which client and which tag. Without dimensional breakdown, you have to ssh into the log file to find out.

2.3 Hot-reload cadence

KPI	Definition	Source	Widget	Alert	Effort
`config.reloadsPerHour`	Reload events per hour	Derived from `configReloadCount`	Sparkline	> 10/hr → unusual; misconfig loop?	1 h
`config.lastReloadDelta`	Summary of what changed on last reload	Already in `mbproxy.config.reload.applied` event; surface here	Text snippet	None — informational	2 h

Why this matters. Config thrashing is a smell — usually means an automation tool is fighting with a manual edit or a CI deploy is misconfigured.

2.4 Memory / process health

KPI	Definition	Source	Widget	Alert	Effort
`process.workingSetMb`	`Process.GetCurrentProcess().WorkingSet64 / 1MB`	New	Stat tile	> 1024 MB → warn (54 PLCs shouldn't need that much)	0.5 h
`process.gcCollections.gen0/1/2`	GC counts per generation	`GC.CollectionCount(n)`	Sparkline	Gen-2 frequency → memory pressure	0.5 h
`process.threadCount`	`Process.Threads.Count`	New	Stat tile	> 200 → leak?	0.5 h

Why this matters. A long-running service in a 24/7 plant needs to prove it's not leaking. These three numbers catch 90 % of common leak patterns. Each is one Process API call, no perf overhead.

Real-time updates via SignalR

Today's status surface is poll-based: the HTML page uses a 5-second meta-refresh, and Prometheus / custom HMI scrapers hit /status.json on their own cadence. For a glance dashboard or a TSDB scrape that's fine. For a live fleet dashboard with many panels open, polling 54 PLCs at 1 Hz means ~54 HTTP round-trips per second from the dashboard backend, and a state transition (e.g., a listener flipping bound → recovering) is invisible until the next poll window. SignalR addresses both: one persistent connection per dashboard client, server pushes counter deltas and discrete events at the cadence that makes sense for each kind of update.

The recommendation is additive, not replacement. Keep /status.json for scrapers and the meta-refresh HTML for the operator-with-a-browser case. Add a SignalR hub for full-screen live dashboards. Existing consumers do not change.

Why this is cheap to add

The Microsoft.AspNetCore.App framework reference that Phase 07 added to the csproj already includes Microsoft.AspNetCore.SignalR — no new NuGet, no version pinning, no AOT concerns. The hub mounts on the existing Kestrel server that runs on Mbproxy.AdminPort. No additional port, no additional listener supervision, no additional shutdown path.

Architecture

                                  ┌─→ Dashboard A (subscribed to "all")
ProxyWorker / Supervisors ──┐     │
ConfigReconciler ───────────┤     │
ProxyCounters ──────────────┼──→ StatusBroadcaster ──→ StatusHub ──┼─→ Dashboard B (subscribed to "plc:Line1-Mixer")
ServiceCounters ────────────┘     (background loop +                │
                                   immediate-push paths)            └─→ Dashboard C (subscribed to "service")

StatusHub : Hub — the SignalR endpoint mounted at /hub/status on AdminPort. Clients call its methods to subscribe; the server invokes client-side callbacks to deliver updates.
StatusBroadcaster : IHostedService — the background pusher. Holds a Timer (or PeriodicTimer) that ticks at PushIntervalMs (default 1000 ms), builds a StatusResponse via the existing StatusSnapshotBuilder, diffs it against the previous snapshot, and pushes only the changed pieces. Also exposes PushEventAsync(name, props) for the immediate-push paths.
Immediate-push wiring — the existing log events (mbproxy.listener.recovered, mbproxy.config.reload.applied, mbproxy.backend.failed, mbproxy.rewrite.partial_bcd, etc.) gain a fan-out call to broadcaster.PushEventAsync(...) so subscribers see them inside ~10 ms of occurrence rather than at the next poll tick.

Hub contract

Hub URL: https://<host>:<AdminPort>/hub/status

Hub groups — clients subscribe to scopes; the server broadcasts to matching groups:

Group	Receives
`all`	Every update for every PLC + every service-level event
`service`	Service-level events only (`mbproxy.config.`, `mbproxy.admin.`, `mbproxy.startup.`, `mbproxy.shutdown.`)
`plc:<Name>`	One PLC's snapshots + that PLC's events

Server-side methods (client → server):

Method	Purpose
`Task SubscribeFleet()`	Join group `all`
`Task SubscribeService()`	Join group `service`
`Task SubscribePlc(string name)`	Join group `plc:<name>` after validating that `name` exists in current options
`Task Unsubscribe()`	Leave every group; the connection stays open but receives nothing

Client-side callbacks (server → client, named On* per SignalR convention):

Callback	Payload	When
`OnSnapshot(StatusResponse snapshot)`	Full snapshot of the relevant scope (`all`, `service`, or a single PLC)	Sent once on subscribe so the dashboard has a baseline; thereafter only on initial reconnect
`OnPatch(StatusPatch patch)`	Delta of fields that changed since the last push	Periodic — every `PushIntervalMs` if anything changed; skipped if nothing changed
`OnEvent(StatusEvent ev)`	Single discrete event: `{ name, levelString, plc?, propertiesJson, timestampUtc }`	Immediately — fan-out from the existing `[LoggerMessage]` event call sites

StatusPatch carries only the fields that changed since the previous push: it's a Dictionary<string, JsonElement> keyed by JSON path (e.g., "plcs[2].pdus.forwarded", "plcs[2].listener.state"). Dashboard clients apply these to their local model. Keeps wire traffic tiny when the fleet is idle.

What gets pushed, and when

Update kind	Cadence	Volume per PLC	Channel
Counter increments (PDUs, bytes, rewrites)	Every `PushIntervalMs` if changed; coalesced	1 patch / push tick / subscribed group	`OnPatch`
State transitions (`bound ↔ recovering ↔ stopped`)	Immediate	1 event + 1 patch	`OnEvent` + `OnPatch`
Discrete log events at level ≥ Info from the stable vocabulary	Immediate	1 event per occurrence	`OnEvent`
Hot-reload applied / rejected	Immediate	1 event with `propertiesJson` summary	`OnEvent`
Periodic full snapshot	Every 60 s	1 full snapshot	`OnSnapshot`

The periodic full snapshot every 60 s is a self-healing measure: if a patch is missed (rare with SignalR but possible on transport hiccups), the next minute resets the dashboard's local model to ground truth.

Configuration

Extend appsettings.json with:

"Mbproxy": {
  // ... existing keys ...
  "Admin": {
    "SignalR": {
      "Enabled": true,
      "PushIntervalMs": 1000,            // patch cadence
      "FullSnapshotIntervalMs": 60000,   // periodic re-baseline
      "MaxConcurrentClients": 32,        // refuse new connections beyond this
      "MaxGroupsPerClient": 8            // anti-runaway-subscription guard
    }
  }
}

Defaults make the feature opt-in-able-by-omission: if SignalR.Enabled = false, the hub is not mapped, the broadcaster is not started, and there is zero runtime cost. Hot-reload of these keys is desirable but lower priority than core functionality — first ship with restart-required.

Implementation outline

Hub class — src/Mbproxy/Admin/StatusHub.cs. Inherits Hub. Implements the four Subscribe* / Unsubscribe methods. OnConnectedAsync rejects if Context.Items.Count > MaxConcurrentClients (track in a static ConcurrentDictionary<string, byte> indexed by ConnectionId).
Broadcaster — src/Mbproxy/Admin/StatusBroadcaster.cs : IHostedService. Constructor takes IHubContext<StatusHub>, StatusSnapshotBuilder, IOptionsMonitor<MbproxyOptions>. The push loop is a while (!ct.IsCancellationRequested) { await timer.WaitForNextTickAsync(ct); ... } body — wins over Timer for cancellation correctness.
DTOs — StatusPatch and StatusEvent records added to StatusDto.cs, registered with the source-gen StatusJsonContext.
Event fan-out — the existing [LoggerMessage] partial methods stay; add a thin RealtimeLogEvents wrapper class that logs AND calls broadcaster.PushEventAsync(...). Call sites in supervisors / pipelines / reconciler swap to the wrapper. Keeps log-only call sites and broadcast-too call sites both readable.
Hub mapping — AdminEndpointHost adds app.MapHub<StatusHub>("/hub/status") if SignalR.Enabled. The Kestrel pipeline stays minimal: the hub is the only WebSocket-capable endpoint.
Shutdown — StatusBroadcaster.StopAsync cancels its pump and the hub's Dispose chain handles connection teardown. The existing ShutdownCoordinator deadline applies.

Test approach

Use the Microsoft.AspNetCore.SignalR.Client package (NuGet) in the test csproj only. Pattern:

[Fact]
[Trait("Category", "E2E")]
public async Task SignalR_StatePatchFiresWithin_500ms_OfBackendException()
{
    // Arrange: start host on a random AdminPort, build a SignalR client.
    var connection = new HubConnectionBuilder()
        .WithUrl($"http://localhost:{adminPort}/hub/status")
        .Build();

    var patches = new ConcurrentQueue<StatusPatch>();
    connection.On<StatusPatch>("OnPatch", patches.Enqueue);
    await connection.StartAsync(TestContext.Current.CancellationToken);
    await connection.InvokeAsync("SubscribePlc", "TestPLC", TestContext.Current.CancellationToken);

    // Act: induce a backend exception (e.g., point a configured PLC at 127.0.0.1:1).
    // ... drive request through proxy ...

    // Assert: a patch with backend.connectsFailed != 0 arrives within 500 ms.
    var deadline = DateTime.UtcNow.AddMilliseconds(500);
    while (DateTime.UtcNow < deadline && !patches.Any(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed")))
        await Task.Delay(20, TestContext.Current.CancellationToken);

    patches.ShouldContain(p => p.Fields.ContainsKey("plcs[0].backend.connectsFailed"));
}

Skip-safe like the existing E2E suite: if the simulator isn't available, the test skips cleanly.

Coverage targets for the new tests:

SignalR_Subscribe_DeliversInitialSnapshot
SignalR_Patch_FiresWithinPushInterval_AfterCounterChange
SignalR_Event_FiresWithin_100ms_OfListenerRecovered
SignalR_SubscribePlc_OnlyDeliversThatPlcEvents — verifies group filtering
SignalR_MaxConcurrentClients_RefusesExcess — capacity guard
SignalR_FullSnapshotReBaseline_FiresEvery_FullSnapshotIntervalMs

Operational considerations

Authentication / authorisation. Same network-trust assumption as the rest of the admin endpoint — none in-process. If a hostile network is in scope, terminate at a reverse proxy that enforces auth (IIS, nginx) and treat SignalR like any other HTTP path through that proxy.
Transport. SignalR negotiates: WebSocket first, then Server-Sent Events, then long polling. The 0/1/2-RTT cost difference matters only for the first connection; subsequent updates are push regardless of transport.
Backpressure. Hub.Clients.Group("all").SendAsync does not buffer per-client. If a dashboard is slow, SignalR slows its writes; the broadcaster's push tick still runs at 1 Hz to all healthy clients. A slow client does not block the proxy.
Reconnection. The .NET / browser SignalR clients reconnect automatically with exponential backoff. The periodic full snapshot every 60 s ensures the dashboard re-baselines after a reconnect even without explicit re-subscription logic on the client side.
Cardinality at scale. 32 concurrent clients × 54 PLC subscriptions × 1 Hz patches × ~500 bytes / patch ≈ 850 KB/s outbound at saturation. Well within Kestrel's capacity on commodity hardware. The MaxConcurrentClients guard exists to prevent a misconfigured deploy from accidentally pointing 1000 dashboards at the same proxy.
CORS. If dashboards run on a different origin (likely), enable CORS on the admin app for /hub/status only. Add AdminCors.AllowedOrigins to appsettings.json as an array of allowed origin strings; an empty array means same-origin only.
Logging. SignalR's internal logs are noisy at Information. In appsettings.json, set the Microsoft.AspNetCore.SignalR category to Warning and Microsoft.AspNetCore.Http.Connections to Warning so the proxy's own event stream isn't drowned out.

Effort estimate

Work	Hours
Hub + DTOs + broadcaster	6 h
Event fan-out wiring (existing log events)	3 h
AdminEndpointHost integration + appsettings binding	2 h
E2E test suite (6 tests using SignalR .NET client)	4 h
Documentation (this section graduates from proposal to fact; design.md update)	1 h
Total	~16 h

This is comparable to Phase 07's status-page implementation (~14 hours) and slots well as a follow-on phase if SignalR turns out to be wanted in production.

Implementation notes

Where rates and percentiles should live

Two reasonable answers:

Compute in the proxy, expose pre-computed values in /status.json. Pro: dashboard tools don't need anything beyond raw HTTP scraping. Con: we own the windowing logic; choosing the wrong window sizes is annoying to change.
Expose raw cumulative counters; let the dashboard tool (Prometheus, Grafana) compute rates. Pro: zero in-process state; dashboard tooling does this natively and well. Con: requires a real TSDB sidecar.

Recommendation: ship Tier 1 rate metrics computed in-process for the operator who just opens http://<host>:8080/ in a browser, AND keep the raw counters so a real TSDB can scrape them too. The in-process windowed values are best-effort; the raw counters are authoritative.

Counter additions vs computed values

A few proposed KPIs require new counters in ProxyCounters or ServiceCounters, not just derivations:

pdus.lastForwardedUtc — new volatile long _lastForwardedTicks on ProxyCounters.
listener.boundRatio.* — new StateDurationTracker on PlcListenerSupervisor.
partialBcd.byClient / invalidBcd.byAddress — new ConcurrentDictionary<string,long> / ConcurrentDictionary<ushort,long> on PerPlcContext. Keep cardinality bounded (cap to top-N or use a count-min sketch for very high-cardinality cases).
process.* — read fresh on every snapshot from Process.GetCurrentProcess() — no stored state.

Snapshot serialization cost

StatusResponse is built per-request to /status.json. The current shape allocates one record per PLC plus nested children. Adding the Tier 1 fields adds ~6 longs per PLC = trivial allocation cost. Adding Tier 2 dimensional maps (e.g., invalidBcd.byAddress) adds a small dictionary serialization per PLC — fine for 54 PLCs × a few unique error addresses, but cap the dictionary size in code (top-50 by count, drop the rest) to keep /status.json under a few hundred KB even when something goes badly wrong.

Widget	Use for
Stat (big number)	Service-wide aggregates, counts, latest timestamps
Gauge	Ratios (availability, success rate, queue depth)
Sparkline	Rates, percentiles, time-series trends
Stacked area	Bandwidth, PDU-by-FC breakdown over time
Heatmap	Per-address / per-client dimensional breakdowns
Cell-coloured table	Per-PLC status (54 rows, one per PLC, columns of KPIs)

Backwards-compat policy

The fields currently in /status.json are frozen — adding fields is fine, removing or renaming is a breaking change. Treat the field-name table in design.md → "Status page" as the contract; new fields ship via PRs that update the contract first.

Cross-references

Field tables for what ships today: design.md → "Status page".
Stable log event names (some KPIs are derivable by tailing these): design.md → "Logging" event-name table.
Per-counter wiring lives in src/Mbproxy/Proxy/ProxyCounters.cs and src/Mbproxy/ServiceCounters.cs.
The status HTML page is rendered by src/Mbproxy/Admin/StatusHtmlRenderer.cs; the JSON DTOs and source-gen context live in src/Mbproxy/Admin/StatusDto.cs.

32 KiB Raw Blame History Unescape Escape