56eee3c563
Adds the mbproxy service end-to-end. Phases 00-08 implement the production-ready single-listener / 1:1-backend transparent Modbus TCP proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260 fleet. Phase 9 replaces the connection layer with a single backend socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's 4-concurrent-client cap as an operational ceiling. Phase 9 additions of note: - PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap - InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing for Phase 10 read coalescing — do not collapse to a single field) - Per-request watchdog: surfaces Modbus exception 0x0B to upstream on BackendRequestTimeoutMs, defending against lost responses, dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed- request bug (its ServerRequestHandler.last_pdu state race) - Status DTO + HTML gain inFlight / maxInFlight / txIdWraps / disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md) Tests: 263 unit + 38 E2E. Multiplexer correctness under truly concurrent backend traffic is proved against a stub backend in PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus 3.13's single-PDU framer stays in known-good mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
135 lines
9.6 KiB
Markdown
135 lines
9.6 KiB
Markdown
# Phase 08 — Windows service hardening
|
||
|
||
Install / uninstall scripts, graceful shutdown, Windows Event Log integration, and the public-facing `README.md` that the root `wwtools/CLAUDE.md` index points at. This is the "ship it" phase.
|
||
|
||
**Depends on:** Phase 04 (rewriter), Phase 07 (status page).
|
||
**Parallel-safe with:** nothing.
|
||
|
||
## Goal
|
||
|
||
After this phase, an operator can:
|
||
|
||
1. `dotnet publish` the service into a self-contained folder.
|
||
2. Run `install.ps1` to register it as a Windows service.
|
||
3. See it appear in `services.msc` running as `Local System` (default — overridable to a managed service account).
|
||
4. Stop it cleanly via `sc.exe stop mbproxy`; the service finishes all in-flight PDUs and exits within 10 s.
|
||
5. Read crash reasons from the Windows Event Log alongside the Serilog rolling-file output.
|
||
6. Read [`../../mbproxy/README.md`](../../mbproxy/README.md) to figure all of this out without needing to talk to a developer.
|
||
|
||
## Outputs
|
||
|
||
```
|
||
mbproxy/README.md # tool-level human entry point (per DOCS-GUIDE Layer 2)
|
||
mbproxy/install/install.ps1 # registers the service
|
||
mbproxy/install/uninstall.ps1 # removes it
|
||
mbproxy/install/mbproxy.config.template.json # commented appsettings.json for ops
|
||
mbproxy/docs/operations.md # ops runbook (install, upgrade, troubleshooting)
|
||
|
||
src/Mbproxy/Diagnostics/ShutdownCoordinator.cs # graceful-shutdown helper
|
||
src/Mbproxy/Diagnostics/EventLogBridge.cs # logs critical events to Windows Event Log
|
||
|
||
tests/Mbproxy.Tests/Diagnostics/ShutdownCoordinatorTests.cs
|
||
```
|
||
|
||
Modifications:
|
||
- `src/Mbproxy/Program.cs` — wire `ShutdownCoordinator` into the host-stop signal. Wire `EventLogBridge` as a Serilog sub-sink for events at Error and above when running under Windows Service (`WindowsServiceHelpers.IsWindowsService()` true).
|
||
- `mbproxy/Mbproxy.csproj` — `<PublishSingleFile>true</PublishSingleFile>` and `<SelfContained>true</SelfContained>` for the publish profile.
|
||
- `../CLAUDE.md` (the root `wwtools/CLAUDE.md`) — update the `mbproxy` index row to point at the new `mbproxy/README.md` (per the maintenance note in `mbproxy/CLAUDE.md`).
|
||
- `mbproxy/CLAUDE.md` — update the "Current state" section to reflect the post-implementation state (no longer "no code yet"), and the Maintenance section to note that the README is now the canonical human entry point.
|
||
|
||
## Tasks
|
||
|
||
1. **`mbproxy/README.md`** — follows the DOCS-GUIDE Layer-2 template exactly. Required sections in order: one-sentence identification, hard constraints / prerequisites, layout, resource index, build & run, install. Cross-link to `docs/design.md`, `docs/plan/README.md`, `docs/operations.md`, `CLAUDE.md`. No deep prose tutorials; the README routes.
|
||
2. **`mbproxy/install/install.ps1`** — parameters: `-InstallPath <path>` (default `C:\Program Files\Mbproxy`), `-ServiceName <name>` (default `mbproxy`), `-DisplayName <text>`, `-Account <managed-service-account>` (default `LocalSystem`). Behaviour:
|
||
- Verifies admin rights; fails with a clear message if not elevated.
|
||
- Copies the publish output (passed via `-PublishOutput <path>`) to `InstallPath`.
|
||
- Runs `sc.exe create <ServiceName> binPath= "<InstallPath>\Mbproxy.exe" start= auto displayName= "<DisplayName>" obj= <Account>`.
|
||
- Sets the failure-action policy: restart after 60 s on first/second failure, no restart on subsequent (`sc.exe failure ...`).
|
||
- Creates `%ProgramData%\mbproxy\logs\` with appropriate ACLs.
|
||
- Copies `mbproxy.config.template.json` to `%ProgramData%\mbproxy\appsettings.json` if no config exists.
|
||
- Optionally starts the service if `-Start` flag is passed.
|
||
3. **`mbproxy/install/uninstall.ps1`** — stops the service if running, `sc.exe delete <ServiceName>`, removes `InstallPath` (with `-KeepConfig` flag to preserve `%ProgramData%\mbproxy\appsettings.json`).
|
||
4. **`mbproxy/install/mbproxy.config.template.json`** — a fully commented `appsettings.json` showing the full schema with example values and inline `//` comments describing every field. (Use `appsettings.jsonc` semantics; .NET's configuration loader tolerates `//` comments when configured to.)
|
||
5. **`ShutdownCoordinator.cs`** — orchestrates graceful shutdown on `IHostApplicationLifetime.ApplicationStopping`:
|
||
- Stop accepting new upstream connections on all `PlcListenerSupervisor`s.
|
||
- Wait for in-flight PDUs to complete with a `10 s` deadline (configurable via `Connection.GracefulShutdownTimeoutMs`, default 10000).
|
||
- Stop the admin endpoint.
|
||
- Cancel all remaining work. Log `mbproxy.shutdown.complete` with `InFlightAtCancel` count.
|
||
6. **`EventLogBridge.cs`** — adds a Serilog sub-sink that writes events with level >= Error to the Windows Event Log under source `mbproxy`. Only enabled when running as a Windows Service. The install script creates the event source.
|
||
7. **`mbproxy/docs/operations.md`** — operations runbook:
|
||
- Install / uninstall steps (mirror to `README.md`).
|
||
- Upgrade procedure (stop service, copy new binaries, start).
|
||
- Where logs live, how to roll them, retention defaults.
|
||
- Common failure modes (port already in use, PLC unreachable, BCD validation reject) with the relevant log event names and what to check.
|
||
- The `services.msc` / `sc.exe` / `Get-Service` commands operators will actually use.
|
||
- How to safely edit `appsettings.json` for hot-reload (with the rejection-keeps-old-config promise).
|
||
|
||
## Public surface declared in this phase
|
||
|
||
```csharp
|
||
namespace Mbproxy.Diagnostics;
|
||
|
||
internal sealed class ShutdownCoordinator {
|
||
public Task ShutdownAsync(int timeoutMs, CancellationToken hostCt);
|
||
}
|
||
|
||
internal sealed class EventLogBridge { /* Serilog sub-sink */ }
|
||
```
|
||
|
||
No additional public types are needed; all surfaces from previous phases remain stable.
|
||
|
||
## Tests required
|
||
|
||
### Unit (`Category = Unit`)
|
||
|
||
`ShutdownCoordinatorTests` (≥ 4 tests):
|
||
|
||
1. `Shutdown_NoActiveConnections_CompletesImmediately`
|
||
2. `Shutdown_OneActiveConnection_WaitsForCompletion`
|
||
3. `Shutdown_TimeoutExceeded_CancelsRemainingWork_AndReportsCount`
|
||
4. `Shutdown_AdminEndpointStopped_AfterListenersStopped` — ordering test.
|
||
|
||
### E2E (`Category = E2E`)
|
||
|
||
`ShutdownE2ETests` (≥ 2 tests, against simulator):
|
||
|
||
1. `E2E_StopHost_WithConnectedClient_DrainsCleanlyWithin10s` — start host, connect NModbus, issue 5 back-to-back FC03 reads, signal host stop, assert all 5 complete and the client's TCP socket is closed cleanly.
|
||
2. `E2E_StopHost_DuringInFlightRequest_CancelsAfterTimeout` — same but with a `Connection.BackendRequestTimeoutMs` that exceeds the shutdown deadline; assert shutdown completes within the deadline and the in-flight request was cancelled.
|
||
|
||
### Manual / smoke
|
||
|
||
- Install the service via `install.ps1` on a clean test VM; confirm it appears in `services.msc` with `Local System` identity.
|
||
- `sc.exe start mbproxy` — service starts, admin endpoint at `http://localhost:8080/` shows the proxy is up.
|
||
- Send `sc.exe stop mbproxy` — service stops within 10 s.
|
||
- Trigger a crash (e.g., corrupt `appsettings.json` while running and reload — actually this is rejected gracefully; better: kill the process with Task Manager) — confirm an entry appears in Windows Event Log under source `mbproxy`.
|
||
- `uninstall.ps1` — service removed cleanly; `%ProgramData%\mbproxy\` preserved unless `-KeepConfig` was not passed.
|
||
|
||
The manual smoke results go into `docs/operations.md` as a "first install" verification checklist.
|
||
|
||
## Phase gate
|
||
|
||
- [ ] Zero-warnings build.
|
||
- [ ] All phase 00–07 tests still green.
|
||
- [ ] All new unit tests green.
|
||
- [ ] All e2e shutdown tests green.
|
||
- [ ] `mbproxy/README.md` exists, follows the DOCS-GUIDE Layer-2 template, and routes into deep docs without duplicating their content.
|
||
- [ ] Root `wwtools/CLAUDE.md` index row for `mbproxy` points at `mbproxy/README.md` (was previously pointing into the design plan or the bare folder).
|
||
- [ ] `install.ps1` and `uninstall.ps1` are idempotent — re-running install when the service already exists is a clean no-op or update, not a hard error.
|
||
- [ ] Windows Event Log source is created during install and removed during uninstall.
|
||
- [ ] `dotnet publish src/Mbproxy/Mbproxy.csproj -c Release -r win-x64 --self-contained true /p:PublishSingleFile=true` produces a single executable under 50 MB.
|
||
- [ ] Manual smoke checklist in `docs/operations.md` has been executed on at least one fresh VM and the result documented.
|
||
|
||
## Out of scope
|
||
|
||
- Linux / Docker packaging. The design fixes Windows Service as the deployment target.
|
||
- Centralised log aggregation (Splunk forwarder config, Elastic agent, etc.). Document where the logs are; let ops integrate.
|
||
- A signed installer (MSI / setup.exe). PowerShell-driven install is the contract; an MSI can be added later if procurement demands it.
|
||
- Metric exposition for Prometheus / OpenTelemetry. The status page's `/status.json` is sufficient for the operational needs declared in the design.
|
||
|
||
## Notes for the subagent
|
||
|
||
- The Windows Event Log source creation requires admin rights — that's already a precondition for `install.ps1`. Do not try to create the source at runtime from the service itself (it would fail when the service runs as a non-admin account).
|
||
- Single-file publish makes `Assembly.GetExecutingAssembly().Location` empty. If `AssemblyVersionAccessor` (phase 07) used that, swap to `Assembly.GetExecutingAssembly().GetCustomAttribute<AssemblyInformationalVersionAttribute>()`.
|
||
- The `mbproxy/README.md` is what an operator reads first. Be ruthless about length — aim for under 100 lines. The DOCS-GUIDE says routes, not tutorials.
|
||
- After this phase merges, the project is feature-complete against [`../design.md`](../design.md). Any further work belongs in a NEW design revision (dated, in the same doc) and a new phase plan.
|