Files
wwtools/mbproxy/docs/plan/06-hot-reload.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

159 lines
9.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Phase 06 — Configuration hot-reload
Subscribe to `IOptionsMonitor<MbproxyOptions>.OnChange` and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host.
**Depends on:** Phase 05 (supervisor lifecycle).
**Parallel-safe with:** nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options).
## Goal
A `appsettings.json` save propagates per the design's reconcile table:
| Change | Action |
|--------|--------|
| `BcdTags.Global` add/remove/width | Rebuild every PLC's `BcdTagMap`, swap atomically. Next PDU sees it. |
| `Plcs[i].BcdTags.{Add,Remove}` | Rebuild that PLC's `BcdTagMap` only. |
| New `Plcs[i]` | Create supervisor + context, start it. |
| Removed `Plcs[i]` | Stop supervisor, close all client connections to it. |
| Changed `ListenPort` / `Host` | Stop + start the supervisor (remove + add semantics). |
| `Connection.Backend*TimeoutMs` | Take effect on the next backend connect / request. |
| Invalid reload | Reject as a whole; keep current state; log `mbproxy.config.reload.rejected`. |
Validation runs FIRST. A reload that would produce duplicate `ListenPort` values, or a `BcdTagMapBuilder.Build` error for any PLC, is rejected atomically before any state mutates.
## Outputs
```
src/Mbproxy/Configuration/ConfigReconciler.cs # OnChange handler; orchestrates the apply
src/Mbproxy/Configuration/ReloadValidator.cs # cross-PLC validation (duplicate ports, etc.)
src/Mbproxy/Configuration/ReloadPlan.cs # immutable diff record between current and new
tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs
tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs
tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs # real appsettings.json mutation, real host
```
Modifications:
- `src/Mbproxy/Proxy/ProxyWorker.cs` — accept a `ConfigReconciler` and forward `IOptionsMonitor.OnChange` to it; on startup, also seed the reconciler with the initial snapshot.
- `src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs` — expose a `Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct)` that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.)
- Add `mbproxy.config.reload.applied` and `mbproxy.config.reload.rejected` `[LoggerMessage]` events.
- `src/Mbproxy/Options/MbproxyOptions.cs` — wire `IValidateOptions<MbproxyOptions>` to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by `ReloadValidator` because it requires inspecting multiple `Plcs[i]` together, which `IValidateOptions` doesn't naturally express.
## Tasks
1. **`ReloadPlan.cs`** — immutable record describing the diff:
```csharp
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove, // PLC names
IReadOnlyList<(string Name, PlcOptions New)> ToRestart, // port or host changed
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat, // tag map changed
ConnectionOptions Connection);
```
Computed by a pure function `ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next)`; PLC identity is keyed on `Name` (NOT on `ListenPort`, which is mutable).
2. **`ReloadValidator.cs`** — single static method `Validate(MbproxyOptions next, out IReadOnlyList<string> errors)`:
- PLC names are unique and non-empty.
- `ListenPort` values are unique.
- For each PLC, `BcdTagMapBuilder.Build(global, perPlc).Errors` is empty.
- `AdminPort` doesn't collide with any `Plcs[i].ListenPort`.
- All ports are in `[1, 65535]`.
3. **`ConfigReconciler.cs`** — subscribes via constructor-injected `IOptionsMonitor<MbproxyOptions>.OnChange`. On change:
- Snapshot the new options.
- Run `ReloadValidator.Validate`. On failure: log `mbproxy.config.reload.rejected` with the error list; do nothing else.
- Compute `ReloadPlan` against the current snapshot.
- Apply the plan in order:
1. Stop supervisors in `ToRemove` (concurrently).
2. Stop+restart supervisors in `ToRestart` (concurrently).
3. Build new `PerPlcContext` for each `ToReseat` entry and call `supervisor.ReplaceContextAsync(newCtx)`.
4. Build supervisors for `ToAdd`, start them.
- On success: log `mbproxy.config.reload.applied` with summary (`PlcsAdded`, `PlcsRemoved`, `PlcsReseated`, `TagListDelta`). Record `lastReloadUtc` and bump `reloadCount` on a service-wide counter (consumed by phase 07).
- On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.)
4. **`ProxyWorker.cs`** updates — register the reconciler with the host and wire startup to use it for the initial snapshot.
## Public surface declared in this phase
```csharp
namespace Mbproxy.Configuration;
internal sealed class ConfigReconciler : IDisposable {
public ConfigReconciler(IOptionsMonitor<MbproxyOptions> monitor, /* dependencies */);
public Task ApplyAsync(MbproxyOptions next, CancellationToken ct); // exposed for tests
public void Dispose();
}
public sealed record ReloadPlan(
IReadOnlyList<PlcOptions> ToAdd,
IReadOnlyList<string> ToRemove,
IReadOnlyList<(string Name, PlcOptions New)> ToRestart,
IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,
ConnectionOptions Connection) {
public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next);
}
internal static class ReloadValidator {
public static bool Validate(MbproxyOptions next, out IReadOnlyList<string> errors);
}
```
## Tests required
### Unit (`Category = Unit`)
`ReloadValidatorTests` (≥ 6 tests):
1. `Validate_DuplicatePlcName_Fails`
2. `Validate_DuplicateListenPort_Fails`
3. `Validate_AdminPortCollidesWith_PlcListenPort_Fails`
4. `Validate_PerPlc_BcdMapBuildError_Fails`
5. `Validate_PortOutOfRange_Fails`
6. `Validate_HappyPath_Passes`
`ReloadPlanTests` (≥ 5 tests):
1. `Compute_AddOnePlc_OnlyToAddPopulated`
2. `Compute_RemoveOnePlc_OnlyToRemovePopulated`
3. `Compute_ChangePort_GoesToToRestart_NotToReseat`
4. `Compute_ChangePerPlcTagOverride_GoesToToReseat`
5. `Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart`
`ConfigReconcilerTests` (≥ 4 tests, using a fake `IOptionsMonitor` + fake supervisor factory):
1. `Apply_HappyPath_StartsAndStopsSupervisors_PerPlan`
2. `Apply_ValidationFails_NoMutationOccurs_AndLogsRejected`
3. `Apply_ReseatTagMap_DoesNotRestartSupervisor`
4. `Apply_ConcurrentReloads_Are_Serialised` — two rapid changes get processed in order, no interleaving.
### E2E (`Category = E2E`)
`HotReloadE2ETests` (≥ 4 tests, using a real `Host.CreateApplicationBuilder` + temp appsettings.json file):
1. `E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable` — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s.
2. `E2E_RemovePlcAtRuntime_ClosesUpstreamConnections` — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s.
3. `E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately` — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234.
4. `E2E_InvalidReload_DoesNotMutateRunningState` — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and `mbproxy.config.reload.rejected` is logged.
## Phase gate
- [ ] Zero-warnings build.
- [ ] All phase 0005 tests still green.
- [ ] All new unit tests green.
- [ ] All e2e hot-reload tests green when the simulator is available.
- [ ] `mbproxy.config.reload.applied` / `.rejected` events match the design's properties list.
- [ ] A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation.
- [ ] The reconciler serializes concurrent `OnChange` notifications (`SemaphoreSlim` or equivalent) so two file saves in quick succession don't race.
- [ ] Counters `service.config.reloadCount` and `service.config.reloadRejectedCount` are bumped correctly.
## Out of scope
- Watching for files OTHER than `appsettings.json` (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract.
- Reloading Serilog log levels at runtime. Possible but not in this phase.
- A reload audit log file. The accept/reject events are sufficient.
- Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract.
## Notes for the subagent
- `IOptionsMonitor.OnChange` can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last `OnChange` before computing the plan. Document the choice in code.
- The reconciler must NOT block the `OnChange` callback thread for I/O (`StopAsync` etc.). Use `Channel<ReloadRequest>` or a `Task.Run`-style hand-off so the callback returns immediately.
- When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it.
- `BcdTagMapBuilder.Build` is the validator for tag-list well-formedness; do not duplicate that validation in `ReloadValidator`. The validator just calls `Build` and checks the `Errors` list.