Files
wwtools/mbproxy/docs/plan/08-service-hardening.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

9.6 KiB
Raw Blame History

Phase 08 — Windows service hardening

Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing README.md that the root wwtools/CLAUDE.md index points at. This is the "ship it" phase.

Depends on: Phase 04 (rewriter), Phase 07 (status page). Parallel-safe with: nothing.

Goal

After this phase, an operator can:

  1. dotnet publish the service into a self-contained folder.
  2. Run install.ps1 to register it as a Windows service.
  3. See it appear in services.msc running as Local System (default — overridable to a managed service account).
  4. Stop it cleanly via sc.exe stop mbproxy; the service finishes all in-flight PDUs and exits within 10 s.
  5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
  6. Read ../../mbproxy/README.md to figure all of this out without needing to talk to a developer.

Outputs

mbproxy/README.md                            # tool-level human entry point (per DOCS-GUIDE Layer 2)
mbproxy/install/install.ps1                  # registers the service
mbproxy/install/uninstall.ps1                # removes it
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
mbproxy/docs/operations.md                   # ops runbook (install, upgrade, troubleshooting)

src/Mbproxy/Diagnostics/ShutdownCoordinator.cs   # graceful-shutdown helper
src/Mbproxy/Diagnostics/EventLogBridge.cs        # logs critical events to Windows Event Log

tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs

Modifications:

  • src/Mbproxy/Program.cs — wire ShutdownCoordinator into the host-stop signal. Wire EventLogBridge as a Serilog sub-sink for events at Error and above when running under Windows Service (WindowsServiceHelpers.IsWindowsService() true).
  • mbproxy/Mbproxy.csproj<PublishSingleFile>true</PublishSingleFile> and <SelfContained>true</SelfContained> for the publish profile.
  • ../CLAUDE.md (the root wwtools/CLAUDE.md) — update the mbproxy index row to point at the new mbproxy/README.md (per the maintenance note in mbproxy/CLAUDE.md).
  • mbproxy/CLAUDE.md — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.

Tasks

  1. mbproxy/README.md — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to docs/design.md, docs/plan/README.md, docs/operations.md, CLAUDE.md. No deep prose tutorials; the README routes.
  2. mbproxy/install/install.ps1 — parameters: -InstallPath <path> (default C:\Program Files\Mbproxy), -ServiceName <name> (default mbproxy), -DisplayName <text>, -Account <managed-service-account> (default LocalSystem). Behaviour:
    • Verifies admin rights; fails with a clear message if not elevated.
    • Copies the publish output (passed via -PublishOutput <path>) to InstallPath.
    • Runs sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>.
    • Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (sc.exe failure ...).
    • Creates %ProgramData%\mbproxy\logs\ with appropriate ACLs.
    • Copies mbproxy.config.template.json to %ProgramData%\mbproxy\appsettings.json if no config exists.
    • Optionally starts the service if -Start flag is passed.
  3. mbproxy/install/uninstall.ps1 — stops the service if running, sc.exe delete <ServiceName>, removes InstallPath (with -KeepConfig flag to preserve %ProgramData%\mbproxy\appsettings.json).
  4. mbproxy/install/mbproxy.config.template.json — a fully commented appsettings.json showing the full schema with example values and inline // comments describing every field. (Use appsettings.jsonc semantics; .NET's configuration loader tolerates // comments when configured to.)
  5. ShutdownCoordinator.cs — orchestrates graceful shutdown on IHostApplicationLifetime.ApplicationStopping:
    • Stop accepting new upstream connections on all PlcListenerSupervisors.
    • Wait for in-flight PDUs to complete with a 10 s deadline (configurable via Connection.GracefulShutdownTimeoutMs, default 10000).
    • Stop the admin endpoint.
    • Cancel all remaining work. Log mbproxy.shutdown.complete with InFlightAtCancel count.
  6. EventLogBridge.cs — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source mbproxy. Only enabled when running as a Windows Service. The install script creates the event source.
  7. mbproxy/docs/operations.md — operations runbook:
    • Install / uninstall steps (mirror to README.md).
    • Upgrade procedure (stop service, copy new binaries, start).
    • Where logs live, how to roll them, retention defaults.
    • Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
    • The services.msc / sc.exe / Get-Service commands operators will actually use.
    • How to safely edit appsettings.json for hot-reload (with the rejection-keeps-old-config promise).

Public surface declared in this phase

namespace Mbproxy.Diagnostics;

internal sealed class ShutdownCoordinator {
    public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
}

internal sealed class EventLogBridge { /* Serilog sub-sink */ }

No additional public types are needed; all surfaces from previous phases remain stable.

Tests required

Unit (Category = Unit)

ShutdownCoordinatorTests (≥ 4 tests):

  1. Shutdown_NoActiveConnections_CompletesImmediately
  2. Shutdown_OneActiveConnection_WaitsForCompletion
  3. Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount
  4. Shutdown_AdminEndpointStopped_AfterListenersStopped — ordering test.

E2E (Category = E2E)

ShutdownE2ETests (≥ 2 tests, against simulator):

  1. E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
  2. E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout — same but with a Connection.BackendRequestTimeoutMs that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.

Manual / smoke

  • Install the service via install.ps1 on a clean test VM; confirm it appears in services.msc with Local System identity.
  • sc.exe start mbproxy — service starts, admin endpoint at http://localhost:8080/ shows the proxy is up.
  • Send sc.exe stop mbproxy — service stops within 10 s.
  • Trigger a crash (e.g., corrupt appsettings.json while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source mbproxy.
  • uninstall.ps1 — service removed cleanly; %ProgramData%\mbproxy\ preserved unless -KeepConfig was not passed.

The manual smoke results go into docs/operations.md as a "first install" verification checklist.

Phase gate

  • Zero-warnings build.
  • All phase 0007 tests still green.
  • All new unit tests green.
  • All e2e shutdown tests green.
  • mbproxy/README.md exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
  • Root wwtools/CLAUDE.md index row for mbproxy points at mbproxy/README.md (was previously pointing into the design plan or the bare folder).
  • install.ps1 and uninstall.ps1 are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
  • Windows Event Log source is created during install and removed during uninstall.
  • dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true produces a single executable under 50 MB.
  • Manual smoke checklist in docs/operations.md has been executed on at least one fresh VM and the result documented.

Out of scope

  • Linux / Docker packaging. The design fixes Windows Service as the deployment target.
  • Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
  • A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
  • Metric exposition for Prometheus / OpenTelemetry. The status page's /status.json is sufficient for the operational needs declared in the design.

Notes for the subagent

  • The Windows Event Log source creation requires admin rights — that's already a precondition for install.ps1. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
  • Single-file publish makes Assembly.GetExecutingAssembly().Location empty. If AssemblyVersionAccessor (phase 07) used that, swap to Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>().
  • The mbproxy/README.md is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
  • After this phase merges, the project is feature-complete against ../design.md. Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.