Files
wwtools/mbproxy/docs/plan/07-status-page.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

9.5 KiB
Raw Blame History

Phase 07 — Status page

Stand up the read-only Kestrel-hosted admin endpoint on Mbproxy.AdminPort. Two routes — GET / (self-contained HTML, meta-refresh 5 s) and GET /status.json (the same data as JSON). No admin actions, no auth.

Depends on: Phase 05 (supervisor snapshots), Phase 06 (config reload counters). Parallel-safe with: nothing (touches DI registration + needs counters from both 05 and 06).

Goal

A single port that an operator can open in a browser and see, at a glance:

  • Service uptime, version, last-reload timestamp + counts.
  • Every configured PLC's listener state (bound / recovering / stopped), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream.

Same data is exposed as /status.json for scraping (Prometheus textfile, custom Nagios check, etc.).

Outputs

src/Mbproxy/Admin/AdminEndpointHost.cs            # owns the Kestrel server lifecycle
src/Mbproxy/Admin/StatusSnapshotBuilder.cs        # composes per-PLC + service-wide snapshots
src/Mbproxy/Admin/StatusDto.cs                    # the wire DTOs for /status.json
src/Mbproxy/Admin/StatusHtmlRenderer.cs           # builds the single-page HTML
src/Mbproxy/Admin/AssemblyVersionAccessor.cs      # cached version string

tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs
tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs   # HTTP-level; live Kestrel + HttpClient

Modifications:

  • src/Mbproxy/Mbproxy.csproj — add Microsoft.AspNetCore.App framework reference (the Worker SDK doesn't include ASP.NET Core by default).
  • src/Mbproxy/Program.cs — register AdminEndpointHost as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from IOptionsMonitor<MbproxyOptions>.
  • src/Mbproxy/Proxy/ProxyCounters.cs — extend with per-client counters: IReadOnlyList<ClientCounterSnapshot> Snapshot() includes connected clients with Remote, ConnectedAtUtc, PdusForwarded, LastRoundTripMs.
  • src/Mbproxy/Proxy/PlcConnectionPair.cs — record connect time, expose RemoteEndpoint, track round-trip time per request (EWMA via LastRoundTripMs field).
  • Service-wide counters introduced here: ServiceCounters with UptimeStartedAtUtc, LastReloadUtc, ReloadCount, ReloadRejectedCount. Wired into ConfigReconciler (bump on apply / reject) and the service start path (set started-at).

Tasks

  1. StatusDto.cs — record types matching the design's per-PLC + service-wide field tables verbatim. Use System.Text.Json source generation (JsonSerializerContext) to keep the response allocation-light:
    [JsonSerializable(typeof(StatusResponse))]
    internal partial class StatusJsonContext : JsonSerializerContext;
    
  2. StatusSnapshotBuilder.cs — pulls from injected ProxyWorker (or a slim view of it), ConfigReconciler, ServiceCounters, and each PlcListenerSupervisor. Builds a StatusResponse record. Pure logic; no I/O. The builder is [Sealed] and constructed once via DI; calling Build() is the only operation.
  3. StatusHtmlRenderer.cs — pure function string Render(StatusResponse status). Produces a single HTML document with:
    • <meta http-equiv="refresh" content="5"> for auto-refresh.
    • A header line with service version + uptime + last-reload info.
    • A table per PLC. Columns match the per-PLC field set; listener.state is colour-coded inline (CSS in a <style> block — no external assets).
    • Total page weight under 50 KB for typical fleets; the design's 54-PLC count puts the table at ~54 rows.
  4. AssemblyVersionAccessor.cs — reads AssemblyInformationalVersionAttribute once at startup, caches it as a string. Used for the service.version field.
  5. AdminEndpointHost.csIHostedService that:
    • On start: builds a WebApplication (Kestrel) configured to listen on AdminPort. Maps GET / to a handler that calls StatusSnapshotBuilder.Build() then StatusHtmlRenderer.Render(), returning text/html. Maps GET /status.json to a handler returning JsonSerializer.Serialize(snapshot, StatusJsonContext.Default.StatusResponse). NO other routes.
    • If AdminPort is in use at startup: log mbproxy.admin.bind.failed (new event) at Error, do not throw. The proxy listeners continue to run; only the admin endpoint is missing. Operators see this in logs.
    • On hot-reload of AdminPort: stop and restart the Kestrel server bound to the new port.
    • On stop: Stop() the Kestrel app gracefully with a 2 s deadline.
  6. ServiceCounters.cs (under src/Mbproxy/) — a singleton DI service holding the service-wide counters. Initialize(DateTimeOffset startedAtUtc); RecordReloadApplied(DateTimeOffset); RecordReloadRejected(). Snapshot returns an immutable record.

Public surface declared in this phase

namespace Mbproxy.Admin;

internal sealed class AdminEndpointHost : IHostedService { /* ... */ }

public sealed record StatusResponse(
    ServiceFields Service,
    ListenersAggregate Listeners,
    IReadOnlyList<PlcStatus> Plcs);

public sealed record ServiceFields(
    long UptimeSeconds, string Version,
    DateTimeOffset? ConfigLastReloadUtc, int ConfigReloadCount, int ConfigReloadRejectedCount);

public sealed record ListenersAggregate(int Bound, int Configured);

public sealed record PlcStatus(
    string Name, string Host, int ListenPort,
    PlcListenerStatus Listener,
    PlcClientsStatus Clients,
    PlcPdusStatus Pdus,
    PlcBackendStatus Backend,
    PlcBytesStatus Bytes);

public sealed record PlcListenerStatus(string State, string? LastBindError, int RecoveryAttempts);
public sealed record PlcClientsStatus(int Connected, IReadOnlyList<ClientSnapshot> RemoteEndpoints);
public sealed record ClientSnapshot(string Remote, DateTimeOffset ConnectedAtUtc, long PdusForwarded);
public sealed record PlcPdusStatus(long Forwarded, FcCounts ByFc, long RewrittenSlots, long PartialBcdWarnings);
public sealed record FcCounts(long Fc03, long Fc04, long Fc06, long Fc16, long Other);
public sealed record PlcBackendStatus(long ConnectsSuccess, long ConnectsFailed, ExceptionCounts ExceptionsByCode, double LastRoundTripMs);
public sealed record ExceptionCounts(long Code01, long Code02, long Code03, long Code04);
public sealed record PlcBytesStatus(long UpstreamIn, long UpstreamOut);

Tests required

Unit (Category = Unit)

StatusSnapshotBuilderTests (≥ 6 tests):

  1. Build_NoPlcsConfigured_ReturnsEmptyPlcList
  2. Build_OnePlcBound_PopulatesListenerState_Bound
  3. Build_PlcRecovering_PopulatesLastBindError_AndAttempts
  4. Build_AggregatesListenersBoundAndConfigured
  5. Build_PerClientSnapshot_Includes_RemoteAndConnectedAt_AndPduCount
  6. Build_ServiceFields_IncludeUptime_Version_AndLastReload

StatusHtmlRendererTests (≥ 3 tests):

  1. Render_OnePlc_ProducesValidHtml_WithMetaRefresh
  2. Render_RecoveringPlc_HighlightsState
  3. Render_PageWeightUnder50KB_For54Plcs — assert character length.

E2E (Category = E2E)

AdminEndpointTests (≥ 5 tests, against a live in-process Kestrel + simulator):

  1. Get_StatusJson_ReturnsValidShape
  2. Get_StatusJson_AfterReadFC03_ShowsPduCountIncreased
  3. Get_StatusJson_AfterPartialBcdWrite_ShowsPartialBcdWarning
  4. Get_Root_ReturnsHtml_WithMetaRefresh
  5. AdminPort_BindFailure_ServiceStaysUp_AndLogsBindFailed — pre-bind the AdminPort, start the service, assert proxy listeners come up and the admin endpoint logs the failure.

Phase gate

  • Zero-warnings build.
  • All phase 0006 tests still green.
  • All new unit + e2e tests green.
  • /status.json shape matches the field tables in ../design.md → "Status page" exactly (field names, casing, nesting).
  • Counters on the read path (PdusForwarded, etc.) remain allocation-free; Snapshot() is the only allocating call and it's on the cold path.
  • AdminPort collision is logged but does NOT take down the proxy.
  • Hot-reload of AdminPort works (verified by adding a test in this phase or extending one of phase 06's e2e tests).

Out of scope

  • Authentication / authorisation on the admin port. Design explicitly defers to network-layer trust.
  • Prometheus exposition format. The /status.json shape is the contract; downstream tools can transform.
  • WebSocket push of counters. Meta-refresh is good enough at 54 PLCs.
  • Historical counter retention (rolling windows, time series). Counters are cumulative since process start; restart resets.
  • Per-tag-level telemetry (which BCD addresses got rewritten how often). The per-PLC RewrittenSlots total is enough; finer granularity goes in a future phase if needed.

Notes for the subagent

  • Use the minimal-API style for the two endpoints; no controllers. The whole admin endpoint is ~50 lines of map / handler code.
  • System.Text.Json source generation needs [JsonSerializable] on the DTO chain. Don't use reflection-based serialization in this codebase — it adds AOT-unsafety and is slower for the simple shape.
  • For the HTML page, embed CSS in a <style> block. Do not link external stylesheets — the admin endpoint must work over a firewalled network with no internet egress.
  • Test 3 of AdminEndpointTests requires triggering a partial-BCD warning, which means configuring a 32-bit BCD tag and reading only one half of it through the proxy. This is the same scenario phase 04's e2e test 5 exercised; reuse the setup.
  • The admin port collision test is important: an operator misconfiguration must not take down the proxy itself. Log Error, continue running.