Files
wwtools/mbproxy/docs/plan/08-service-hardening.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

135 lines
9.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 08 — Windows service hardening
Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing `README.md` that the root `wwtools/CLAUDE.md` index points at. This is the "ship it" phase.
**Depends on:** Phase 04 (rewriter), Phase 07 (status page).
**Parallel-safe with:** nothing.
## Goal
After this phase, an operator can:
1. `dotnet publish` the service into a self-contained folder.
2. Run `install.ps1` to register it as a Windows service.
3. See it appear in `services.msc` running as `Local System` (default — overridable to a managed service account).
4. Stop it cleanly via `sc.exe stop mbproxy`; the service finishes all in-flight PDUs and exits within 10 s.
5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
6. Read [`../../mbproxy/README.md`](../../mbproxy/README.md) to figure all of this out without needing to talk to a developer.
## Outputs
```
mbproxy/README.md # tool-level human entry point (per DOCS-GUIDE Layer 2)
mbproxy/install/install.ps1 # registers the service
mbproxy/install/uninstall.ps1 # removes it
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
mbproxy/docs/operations.md # ops runbook (install, upgrade, troubleshooting)
src/Mbproxy/Diagnostics/ShutdownCoordinator.cs # graceful-shutdown helper
src/Mbproxy/Diagnostics/EventLogBridge.cs # logs critical events to Windows Event Log
tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs
```
Modifications:
- `src/Mbproxy/Program.cs` — wire `ShutdownCoordinator` into the host-stop signal. Wire `EventLogBridge` as a Serilog sub-sink for events at Error and above when running under Windows Service (`WindowsServiceHelpers.IsWindowsService()` true).
- `mbproxy/Mbproxy.csproj``<PublishSingleFile>true</PublishSingleFile>` and `<SelfContained>true</SelfContained>` for the publish profile.
- `../CLAUDE.md` (the root `wwtools/CLAUDE.md`) — update the `mbproxy` index row to point at the new `mbproxy/README.md` (per the maintenance note in `mbproxy/CLAUDE.md`).
- `mbproxy/CLAUDE.md` — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.
## Tasks
1. **`mbproxy/README.md`** — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to `docs/design.md`, `docs/plan/README.md`, `docs/operations.md`, `CLAUDE.md`. No deep prose tutorials; the README routes.
2. **`mbproxy/install/install.ps1`** — parameters: `-InstallPath <path>` (default `C:\Program Files\Mbproxy`), `-ServiceName <name>` (default `mbproxy`), `-DisplayName <text>`, `-Account <managed-service-account>` (default `LocalSystem`). Behaviour:
- Verifies admin rights; fails with a clear message if not elevated.
- Copies the publish output (passed via `-PublishOutput <path>`) to `InstallPath`.
- Runs `sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>`.
- Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (`sc.exe failure ...`).
- Creates `%ProgramData%\mbproxy\logs\` with appropriate ACLs.
- Copies `mbproxy.config.template.json` to `%ProgramData%\mbproxy\appsettings.json` if no config exists.
- Optionally starts the service if `-Start` flag is passed.
3. **`mbproxy/install/uninstall.ps1`** — stops the service if running, `sc.exe delete <ServiceName>`, removes `InstallPath` (with `-KeepConfig` flag to preserve `%ProgramData%\mbproxy\appsettings.json`).
4. **`mbproxy/install/mbproxy.config.template.json`** — a fully commented `appsettings.json` showing the full schema with example values and inline `//` comments describing every field. (Use `appsettings.jsonc` semantics; .NET's configuration loader tolerates `//` comments when configured to.)
5. **`ShutdownCoordinator.cs`** — orchestrates graceful shutdown on `IHostApplicationLifetime.ApplicationStopping`:
- Stop accepting new upstream connections on all `PlcListenerSupervisor`s.
- Wait for in-flight PDUs to complete with a `10 s` deadline (configurable via `Connection.GracefulShutdownTimeoutMs`, default 10000).
- Stop the admin endpoint.
- Cancel all remaining work. Log `mbproxy.shutdown.complete` with `InFlightAtCancel` count.
6. **`EventLogBridge.cs`** — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source `mbproxy`. Only enabled when running as a Windows Service. The install script creates the event source.
7. **`mbproxy/docs/operations.md`** — operations runbook:
- Install / uninstall steps (mirror to `README.md`).
- Upgrade procedure (stop service, copy new binaries, start).
- Where logs live, how to roll them, retention defaults.
- Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
- The `services.msc` / `sc.exe` / `Get-Service` commands operators will actually use.
- How to safely edit `appsettings.json` for hot-reload (with the rejection-keeps-old-config promise).
## Public surface declared in this phase
```csharp
namespace Mbproxy.Diagnostics;
internal sealed class ShutdownCoordinator {
public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
}
internal sealed class EventLogBridge { /* Serilog sub-sink */ }
```
No additional public types are needed; all surfaces from previous phases remain stable.
## Tests required
### Unit (`Category = Unit`)
`ShutdownCoordinatorTests` (≥ 4 tests):
1. `Shutdown_NoActiveConnections_CompletesImmediately`
2. `Shutdown_OneActiveConnection_WaitsForCompletion`
3. `Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount`
4. `Shutdown_AdminEndpointStopped_AfterListenersStopped` — ordering test.
### E2E (`Category = E2E`)
`ShutdownE2ETests` (≥ 2 tests, against simulator):
1. `E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s` — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
2. `E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` — same but with a `Connection.BackendRequestTimeoutMs` that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.
### Manual / smoke
- Install the service via `install.ps1` on a clean test VM; confirm it appears in `services.msc` with `Local System` identity.
- `sc.exe start mbproxy` — service starts, admin endpoint at `http://localhost:8080/` shows the proxy is up.
- Send `sc.exe stop mbproxy` — service stops within 10 s.
- Trigger a crash (e.g., corrupt `appsettings.json` while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source `mbproxy`.
- `uninstall.ps1` — service removed cleanly; `%ProgramData%\mbproxy\` preserved unless `-KeepConfig` was not passed.
The manual smoke results go into `docs/operations.md` as a "first install" verification checklist.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0007 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e shutdown tests green.
- [ ] `mbproxy/README.md` exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
- [ ] Root `wwtools/CLAUDE.md` index row for `mbproxy` points at `mbproxy/README.md` (was previously pointing into the design plan or the bare folder).
- [ ] `install.ps1` and `uninstall.ps1` are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
- [ ] Windows Event Log source is created during install and removed during uninstall.
- [ ] `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true` produces a single executable under 50 MB.
- [ ] Manual smoke checklist in `docs/operations.md` has been executed on at least one fresh VM and the result documented.
## Out of scope
- Linux / Docker packaging. The design fixes Windows Service as the deployment target.
- Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
- A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
- Metric exposition for Prometheus / OpenTelemetry. The status page's `/status.json` is sufficient for the operational needs declared in the design.
## Notes for the subagent
- The Windows Event Log source creation requires admin rights — that's already a precondition for `install.ps1`. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
- Single-file publish makes `Assembly.GetExecutingAssembly().Location` empty. If `AssemblyVersionAccessor` (phase 07) used that, swap to `Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>()`.
- The `mbproxy/README.md` is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
- After this phase merges, the project is feature-complete against [`../design.md`](../design.md). Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.