Files
wwtools/mbproxy/docs/plan/06-hot-reload.md
T
Joseph Doherty 56eee3c563 mbproxy: initial commit through Phase 9 (TxId multiplexing)
Adds the mbproxy service end-to-end. Phases 00-08 implement the
production-ready single-listener / 1:1-backend transparent Modbus TCP
proxy with bidirectional BCD rewriting for the ~54-PLC DL205/DL260
fleet. Phase 9 replaces the connection layer with a single backend
socket per PLC plus MBAP TxId rewriting, lifting the H2-ECOM100's
4-concurrent-client cap as an operational ceiling.

Phase 9 additions of note:
- PlcMultiplexer + UpstreamPipe + TxIdAllocator + CorrelationMap
- InFlightRequest with IReadOnlyList<InterestedParty> (load-bearing
  for Phase 10 read coalescing — do not collapse to a single field)
- Per-request watchdog: surfaces Modbus exception 0x0B to upstream
  on BackendRequestTimeoutMs, defending against lost responses,
  dead-PLC paths, and pymodbus 3.13.0's concurrent-multiplexed-
  request bug (its ServerRequestHandler.last_pdu state race)
- Status DTO + HTML gain inFlight / maxInFlight / txIdWraps /
  disconnectCascades / queueDepth (Tier 1.6 in docs/kpi.md)

Tests: 263 unit + 38 E2E. Multiplexer correctness under truly
concurrent backend traffic is proved against a stub backend in
PlcMultiplexerTests; MultiplexerE2ETests paces requests so pymodbus
3.13's single-PDU framer stays in known-good mode.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 01:49:35 -04:00

9.7 KiB
Raw Blame History

Phase 06 — Configuration hot-reload

Subscribe to IOptionsMonitor<MbproxyOptions>.OnChange and reconcile the running supervisors + per-PLC tag maps + connection settings against the new config — without restarting the host.

Depends on: Phase 05 (supervisor lifecycle). Parallel-safe with: nothing (touches the widest cross-cut: supervisors + tag maps + counters + DI options).

Goal

A appsettings.json save propagates per the design's reconcile table:

Change Action
BcdTags.Global add/remove/width Rebuild every PLC's BcdTagMap, swap atomically. Next PDU sees it.
Plcs[i].BcdTags.{Add,Remove} Rebuild that PLC's BcdTagMap only.
New Plcs[i] Create supervisor + context, start it.
Removed Plcs[i] Stop supervisor, close all client connections to it.
Changed ListenPort / Host Stop + start the supervisor (remove + add semantics).
Connection.Backend*TimeoutMs Take effect on the next backend connect / request.
Invalid reload Reject as a whole; keep current state; log mbproxy.config.reload.rejected.

Validation runs FIRST. A reload that would produce duplicate ListenPort values, or a BcdTagMapBuilder.Build error for any PLC, is rejected atomically before any state mutates.

Outputs

src/Mbproxy/Configuration/ConfigReconciler.cs       # OnChange handler; orchestrates the apply
src/Mbproxy/Configuration/ReloadValidator.cs        # cross-PLC validation (duplicate ports, etc.)
src/Mbproxy/Configuration/ReloadPlan.cs             # immutable diff record between current and new

tests/Mbproxy.Tests/Configuration/ReloadValidatorTests.cs
tests/Mbproxy.Tests/Configuration/ConfigReconcilerTests.cs
tests/Mbproxy.Tests/Configuration/HotReloadE2ETests.cs    # real appsettings.json mutation, real host

Modifications:

  • src/Mbproxy/Proxy/ProxyWorker.cs — accept a ConfigReconciler and forward IOptionsMonitor.OnChange to it; on startup, also seed the reconciler with the initial snapshot.
  • src/Mbproxy/Proxy/Supervision/PlcListenerSupervisor.cs — expose a Task ReplaceContextAsync(PerPlcContext newCtx, CancellationToken ct) that atomically swaps the BCD tag map and counters without restarting the listener. Old in-flight connections finish on the old map; new connections use the new map. (Document the brief transition window in comments.)
  • Add mbproxy.config.reload.applied and mbproxy.config.reload.rejected [LoggerMessage] events.
  • src/Mbproxy/Options/MbproxyOptions.cs — wire IValidateOptions<MbproxyOptions> to call the schema-level validator only. Cross-PLC validation (duplicate ports, etc.) is handled by ReloadValidator because it requires inspecting multiple Plcs[i] together, which IValidateOptions doesn't naturally express.

Tasks

  1. ReloadPlan.cs — immutable record describing the diff:
    public sealed record ReloadPlan(
      IReadOnlyList<PlcOptions> ToAdd,
      IReadOnlyList<string> ToRemove,           // PLC names
      IReadOnlyList<(string Name, PlcOptions New)> ToRestart,   // port or host changed
      IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,  // tag map changed
      ConnectionOptions Connection);
    
    Computed by a pure function ReloadPlan.Compute(MbproxyOptions current, MbproxyOptions next); PLC identity is keyed on Name (NOT on ListenPort, which is mutable).
  2. ReloadValidator.cs — single static method Validate(MbproxyOptions next, out IReadOnlyList<string> errors):
    • PLC names are unique and non-empty.
    • ListenPort values are unique.
    • For each PLC, BcdTagMapBuilder.Build(global, perPlc).Errors is empty.
    • AdminPort doesn't collide with any Plcs[i].ListenPort.
    • All ports are in [1, 65535].
  3. ConfigReconciler.cs — subscribes via constructor-injected IOptionsMonitor<MbproxyOptions>.OnChange. On change:
    • Snapshot the new options.
    • Run ReloadValidator.Validate. On failure: log mbproxy.config.reload.rejected with the error list; do nothing else.
    • Compute ReloadPlan against the current snapshot.
    • Apply the plan in order:
      1. Stop supervisors in ToRemove (concurrently).
      2. Stop+restart supervisors in ToRestart (concurrently).
      3. Build new PerPlcContext for each ToReseat entry and call supervisor.ReplaceContextAsync(newCtx).
      4. Build supervisors for ToAdd, start them.
    • On success: log mbproxy.config.reload.applied with summary (PlcsAdded, PlcsRemoved, PlcsReseated, TagListDelta). Record lastReloadUtc and bump reloadCount on a service-wide counter (consumed by phase 07).
    • On any step throwing: best-effort log the partial-apply state at Error, then continue. The host stays up. (The validator should have caught most failure modes; a runtime failure here is a true bug.)
  4. ProxyWorker.cs updates — register the reconciler with the host and wire startup to use it for the initial snapshot.

Public surface declared in this phase

namespace Mbproxy.Configuration;

internal sealed class ConfigReconciler : IDisposable {
    public ConfigReconciler(IOptionsMonitor<MbproxyOptions> monitor, /* dependencies */);
    public Task ApplyAsync(MbproxyOptions next, CancellationToken ct);   // exposed for tests
    public void Dispose();
}

public sealed record ReloadPlan(
    IReadOnlyList<PlcOptions> ToAdd,
    IReadOnlyList<string> ToRemove,
    IReadOnlyList<(string Name, PlcOptions New)> ToRestart,
    IReadOnlyList<(string Name, BcdTagMap NewMap)> ToReseat,
    ConnectionOptions Connection) {
    public static ReloadPlan Compute(MbproxyOptions current, MbproxyOptions next);
}

internal static class ReloadValidator {
    public static bool Validate(MbproxyOptions next, out IReadOnlyList<string> errors);
}

Tests required

Unit (Category = Unit)

ReloadValidatorTests (≥ 6 tests):

  1. Validate_DuplicatePlcName_Fails
  2. Validate_DuplicateListenPort_Fails
  3. Validate_AdminPortCollidesWith_PlcListenPort_Fails
  4. Validate_PerPlc_BcdMapBuildError_Fails
  5. Validate_PortOutOfRange_Fails
  6. Validate_HappyPath_Passes

ReloadPlanTests (≥ 5 tests):

  1. Compute_AddOnePlc_OnlyToAddPopulated
  2. Compute_RemoveOnePlc_OnlyToRemovePopulated
  3. Compute_ChangePort_GoesToToRestart_NotToReseat
  4. Compute_ChangePerPlcTagOverride_GoesToToReseat
  5. Compute_ChangeGlobalTagList_AllPlcsReseat_NoRestart

ConfigReconcilerTests (≥ 4 tests, using a fake IOptionsMonitor + fake supervisor factory):

  1. Apply_HappyPath_StartsAndStopsSupervisors_PerPlan
  2. Apply_ValidationFails_NoMutationOccurs_AndLogsRejected
  3. Apply_ReseatTagMap_DoesNotRestartSupervisor
  4. Apply_ConcurrentReloads_Are_Serialised — two rapid changes get processed in order, no interleaving.

E2E (Category = E2E)

HotReloadE2ETests (≥ 4 tests, using a real Host.CreateApplicationBuilder + temp appsettings.json file):

  1. E2E_AddPlcAtRuntime_NewListenerBinds_AndIsReachable — start the host with one PLC, write a new appsettings adding a second PLC pointing at the simulator on a fresh listen port, drive NModbus against the new proxy port within 2 s.
  2. E2E_RemovePlcAtRuntime_ClosesUpstreamConnections — start with two PLCs and a connected client, write appsettings removing one; client's socket closes within 1 s.
  3. E2E_ChangeGlobalBcdTagList_RewriteReflectsImmediately — start with addr 1072 NOT in BCD list, read raw 0x1234. Write appsettings adding it. Read again, get decoded 1234.
  4. E2E_InvalidReload_DoesNotMutateRunningState — start happy, write a broken appsettings (duplicate ListenPort), assert the host keeps running with the OLD config and mbproxy.config.reload.rejected is logged.

Phase gate

  • Zero-warnings build.
  • All phase 0005 tests still green.
  • All new unit tests green.
  • All e2e hot-reload tests green when the simulator is available.
  • mbproxy.config.reload.applied / .rejected events match the design's properties list.
  • A misconfigured reload (duplicate ports) is rejected atomically — the assertion in test E2E_4 verifies no partial mutation.
  • The reconciler serializes concurrent OnChange notifications (SemaphoreSlim or equivalent) so two file saves in quick succession don't race.
  • Counters service.config.reloadCount and service.config.reloadRejectedCount are bumped correctly.

Out of scope

  • Watching for files OTHER than appsettings.json (env files, dotnet user-secrets, etc.). The default config source set established in phase 00 is the contract.
  • Reloading Serilog log levels at runtime. Possible but not in this phase.
  • A reload audit log file. The accept/reject events are sufficient.
  • Online schema migrations (e.g., renaming a key in an older config to a new one). Reject-the-whole-thing is the simpler contract.

Notes for the subagent

  • IOptionsMonitor.OnChange can fire MULTIPLE times for a single file save on some platforms (text editors saving via rename-and-replace can trigger 2-3 events). Debounce inside the reconciler — a 250 ms quiescent window after the last OnChange before computing the plan. Document the choice in code.
  • The reconciler must NOT block the OnChange callback thread for I/O (StopAsync etc.). Use Channel<ReloadRequest> or a Task.Run-style hand-off so the callback returns immediately.
  • When a supervisor restart is in progress (e.g., port changed), reject further reloads briefly with a queued "retry after current applies" — OR just serialise everything via a single semaphore and accept that a backed-up reload queue gets all changes eventually. Pick the simpler option (semaphore); document it.
  • BcdTagMapBuilder.Build is the validator for tag-list well-formedness; do not duplicate that validation in ReloadValidator. The validator just calls Build and checks the Errors list.