mbproxy: initial commit through Phase 9 (TxId multiplexing)

Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-14 01:49:35 -04:00
parent 2e937228a0
commit 56eee3c563
105 changed files with 18430 additions and 0 deletions
+147
View File
@@ -0,0 +1,147 @@
# Phase 07 — Status page
Stand up the read-only Kestrel-hosted admin endpoint on `Mbproxy.AdminPort`. Two routes — `GET /` (self-contained HTML, meta-refresh 5 s) and `GET /status.json` (the same data as JSON). No admin actions, no auth.
**Depends on:** Phase 05 (supervisor snapshots), Phase 06 (config reload counters).
**Parallel-safe with:** nothing (touches DI registration + needs counters from both 05 and 06).
## Goal
A single port that an operator can open in a browser and see, at a glance:
- Service uptime, version, last-reload timestamp + counts.
- Every configured PLC's listener state (`bound` / `recovering` / `stopped`), last bind error, currently connected clients and their per-client PDU counts, PDU counts by function code, BCD slots rewritten, partial-overlap warnings, backend exception counts by code, last round-trip ms, bytes upstream/downstream.
Same data is exposed as `/status.json` for scraping (Prometheus textfile, custom Nagios check, etc.).
## Outputs
```
src/Mbproxy/Admin/AdminEndpointHost.cs # owns the Kestrel server lifecycle
src/Mbproxy/Admin/StatusSnapshotBuilder.cs # composes per-PLC + service-wide snapshots
src/Mbproxy/Admin/StatusDto.cs # the wire DTOs for /status.json
src/Mbproxy/Admin/StatusHtmlRenderer.cs # builds the single-page HTML
src/Mbproxy/Admin/AssemblyVersionAccessor.cs # cached version string
tests/Mbproxy.Tests/Admin/StatusSnapshotBuilderTests.cs
tests/Mbproxy.Tests/Admin/AdminEndpointTests.cs # HTTP-level; live Kestrel + HttpClient
```
Modifications:
- `src/Mbproxy/Mbproxy.csproj` — add `Microsoft.AspNetCore.App` framework reference (the Worker SDK doesn't include ASP.NET Core by default).
- `src/Mbproxy/Program.cs` — register `AdminEndpointHost` as a hosted service; wire it through DI alongside the proxy worker. AdminPort comes from `IOptionsMonitor<MbproxyOptions>`.
- `src/Mbproxy/Proxy/ProxyCounters.cs` — extend with per-client counters: `IReadOnlyList<ClientCounterSnapshot> Snapshot()` includes connected clients with `Remote`, `ConnectedAtUtc`, `PdusForwarded`, `LastRoundTripMs`.
- `src/Mbproxy/Proxy/PlcConnectionPair.cs` — record connect time, expose `RemoteEndpoint`, track round-trip time per request (EWMA via `LastRoundTripMs` field).
- Service-wide counters introduced here: `ServiceCounters` with `UptimeStartedAtUtc`, `LastReloadUtc`, `ReloadCount`, `ReloadRejectedCount`. Wired into `ConfigReconciler` (bump on apply / reject) and the service start path (set started-at).
## Tasks
1. **`StatusDto.cs`** — record types matching the design's per-PLC + service-wide field tables verbatim. Use `System.Text.Json` source generation (`JsonSerializerContext`) to keep the response allocation-light:
```csharp
[JsonSerializable(typeof(StatusResponse))]
internal partial class StatusJsonContext : JsonSerializerContext;
```
2. **`StatusSnapshotBuilder.cs`** — pulls from injected `ProxyWorker` (or a slim view of it), `ConfigReconciler`, `ServiceCounters`, and each `PlcListenerSupervisor`. Builds a `StatusResponse` record. Pure logic; no I/O. The builder is `[Sealed]` and constructed once via DI; calling `Build()` is the only operation.
3. **`StatusHtmlRenderer.cs`** — pure function `string Render(StatusResponse status)`. Produces a single HTML document with:
- `<meta http-equiv="refresh" content="5">` for auto-refresh.
- A header line with service version + uptime + last-reload info.
- A table per PLC. Columns match the per-PLC field set; `listener.state` is colour-coded inline (CSS in a `<style>` block — no external assets).
- Total page weight under 50 KB for typical fleets; the design's 54-PLC count puts the table at ~54 rows.
4. **`AssemblyVersionAccessor.cs`** — reads `AssemblyInformationalVersionAttribute` once at startup, caches it as a string. Used for the `service.version` field.
5. **`AdminEndpointHost.cs`** — `IHostedService` that:
- On start: builds a `WebApplication` (Kestrel) configured to listen on `AdminPort`. Maps `GET /` to a handler that calls `StatusSnapshotBuilder.Build()` then `StatusHtmlRenderer.Render()`, returning `text/html`. Maps `GET /status.json` to a handler returning `JsonSerializer.Serialize(snapshot, StatusJsonContext.Default.StatusResponse)`. NO other routes.
- If `AdminPort` is in use at startup: log `mbproxy.admin.bind.failed` (new event) at Error, do not throw. The proxy listeners continue to run; only the admin endpoint is missing. Operators see this in logs.
- On hot-reload of `AdminPort`: stop and restart the Kestrel server bound to the new port.
- On stop: `Stop()` the Kestrel app gracefully with a 2 s deadline.
6. **`ServiceCounters.cs`** (under `src/Mbproxy/`) — a singleton DI service holding the service-wide counters. `Initialize(DateTimeOffset startedAtUtc)`; `RecordReloadApplied(DateTimeOffset)`; `RecordReloadRejected()`. Snapshot returns an immutable record.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Admin;
internal sealed class AdminEndpointHost : IHostedService { /* ... */ }
public sealed record StatusResponse(
ServiceFields Service,
ListenersAggregate Listeners,
IReadOnlyList<PlcStatus> Plcs);
public sealed record ServiceFields(
long UptimeSeconds, string Version,
DateTimeOffset? ConfigLastReloadUtc, int ConfigReloadCount, int ConfigReloadRejectedCount);
public sealed record ListenersAggregate(int Bound, int Configured);
public sealed record PlcStatus(
string Name, string Host, int ListenPort,
PlcListenerStatus Listener,
PlcClientsStatus Clients,
PlcPdusStatus Pdus,
PlcBackendStatus Backend,
PlcBytesStatus Bytes);
public sealed record PlcListenerStatus(string State, string? LastBindError, int RecoveryAttempts);
public sealed record PlcClientsStatus(int Connected, IReadOnlyList<ClientSnapshot> RemoteEndpoints);
public sealed record ClientSnapshot(string Remote, DateTimeOffset ConnectedAtUtc, long PdusForwarded);
public sealed record PlcPdusStatus(long Forwarded, FcCounts ByFc, long RewrittenSlots, long PartialBcdWarnings);
public sealed record FcCounts(long Fc03, long Fc04, long Fc06, long Fc16, long Other);
public sealed record PlcBackendStatus(long ConnectsSuccess, long ConnectsFailed, ExceptionCounts ExceptionsByCode, double LastRoundTripMs);
public sealed record ExceptionCounts(long Code01, long Code02, long Code03, long Code04);
public sealed record PlcBytesStatus(long UpstreamIn, long UpstreamOut);
```
## Tests required
### Unit (`Category = Unit`)
`StatusSnapshotBuilderTests` (≥ 6 tests):
1. `Build_NoPlcsConfigured_ReturnsEmptyPlcList`
2. `Build_OnePlcBound_PopulatesListenerState_Bound`
3. `Build_PlcRecovering_PopulatesLastBindError_AndAttempts`
4. `Build_AggregatesListenersBoundAndConfigured`
5. `Build_PerClientSnapshot_Includes_RemoteAndConnectedAt_AndPduCount`
6. `Build_ServiceFields_IncludeUptime_Version_AndLastReload`
`StatusHtmlRendererTests` (≥ 3 tests):
1. `Render_OnePlc_ProducesValidHtml_WithMetaRefresh`
2. `Render_RecoveringPlc_HighlightsState`
3. `Render_PageWeightUnder50KB_For54Plcs` — assert character length.
### E2E (`Category = E2E`)
`AdminEndpointTests` (≥ 5 tests, against a live in-process Kestrel + simulator):
1. `Get_StatusJson_ReturnsValidShape`
2. `Get_StatusJson_AfterReadFC03_ShowsPduCountIncreased`
3. `Get_StatusJson_AfterPartialBcdWrite_ShowsPartialBcdWarning`
4. `Get_Root_ReturnsHtml_WithMetaRefresh`
5. `AdminPort_BindFailure_ServiceStaysUp_AndLogsBindFailed` — pre-bind the AdminPort, start the service, assert proxy listeners come up and the admin endpoint logs the failure.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0006 tests still green.
- [ ] All new unit + e2e tests green.
- [ ] `/status.json` shape matches the field tables in [`../design.md`](../design.md) → "Status page" exactly (field names, casing, nesting).
- [ ] Counters on the read path (`PdusForwarded`, etc.) remain allocation-free; `Snapshot()` is the only allocating call and it's on the cold path.
- [ ] AdminPort collision is logged but does NOT take down the proxy.
- [ ] Hot-reload of `AdminPort` works (verified by adding a test in this phase or extending one of phase 06's e2e tests).
## Out of scope
- Authentication / authorisation on the admin port. Design explicitly defers to network-layer trust.
- Prometheus exposition format. The `/status.json` shape is the contract; downstream tools can transform.
- WebSocket push of counters. Meta-refresh is good enough at 54 PLCs.
- Historical counter retention (rolling windows, time series). Counters are cumulative since process start; restart resets.
- Per-tag-level telemetry (which BCD addresses got rewritten how often). The per-PLC `RewrittenSlots` total is enough; finer granularity goes in a future phase if needed.
## Notes for the subagent
- Use the minimal-API style for the two endpoints; no controllers. The whole admin endpoint is ~50 lines of map / handler code.
- `System.Text.Json` source generation needs `[JsonSerializable]` on the DTO chain. Don't use reflection-based serialization in this codebase — it adds AOT-unsafety and is slower for the simple shape.
- For the HTML page, embed CSS in a `<style>` block. Do not link external stylesheets — the admin endpoint must work over a firewalled network with no internet egress.
- Test 3 of `AdminEndpointTests` requires triggering a partial-BCD warning, which means configuring a 32-bit BCD tag and reading only one half of it through the proxy. This is the same scenario phase 04's e2e test 5 exercised; reuse the setup.
- The admin port collision test is important: an operator misconfiguration must not take down the proxy itself. Log Error, continue running.