Files
wwtools/mbproxy/docs/Testing/Simulator.md
T
Joseph Doherty f49e27e316 mbproxy/docs: split deep docs into focused PascalCase files per StyleGuide
Adds 11 topic-focused docs under docs/{Architecture,Features,Operations,Reference,Testing}/
and links them from README.md's new "Detailed documentation" section. Existing
top-level docs (design.md, kpi.md, operations.md) remain as canonical landings.

Architecture/
  - Overview.md         (150 lines) — listener topology, request flow, per-PLC isolation
  - ConnectionModel.md  (247 lines) — TxId multiplexer, watchdog, disconnect cascade
  - ReadCoalescing.md   (243 lines) — in-flight FC03/04 dedup via InFlightByKeyMap
  - ResponseCache.md    (398 lines) — opt-in per-tag TTL cache + range-overlap invalidation

Features/
  - BcdRewriting.md     (252 lines) — codec, CDAB, FC scope, partial-overlap policy
  - HotReload.md        (189 lines) — IOptionsMonitor + per-change-kind reconcile rules

Operations/
  - Configuration.md    (422 lines) — every Mbproxy:* option + validation rules
  - StatusPage.md       (334 lines) — admin endpoint surface, every JSON field
  - Troubleshooting.md  (364 lines) — diagnosis playbook keyed to log events

Reference/
  - LogEvents.md        (499 lines) — 28 events across 7 categories, grep-verified

Testing/
  - Simulator.md        (235 lines) — pymodbus fixture, skip policy, 3.13 framer quirk

Each doc was written by a dedicated agent against the StyleGuide.md rules with
a per-doc phase gate (PascalCase filename, H1 Title Case, code-fence language
tags, Related Documentation section with >=3 relative links, real type names
verified against src/). Cross-references between docs use relative paths;
all 18 README->docs links and all sibling links resolve.

Known follow-up: docs/design.md lines 215-251 are stale on two log-event
property templates (config.reload.applied and config.reload.rejected) and
mention LogContext.PushProperty scoping that isn't actually used. Reference/
LogEvents.md is now the authoritative event catalog and source-of-truth.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-14 03:44:34 -04:00

236 lines
18 KiB
Markdown

# Simulator Harness
The pymodbus DL205 simulator stands in for real DL205/DL260 hardware in the E2E test suite. This document describes the launcher, the xUnit fixture, the skip policy, the per-test timeout discipline, and the pymodbus 3.13.0 framer quirk that the test strategy works around.
## Why a Simulator
`mbproxy` targets a fleet of AutomationDirect DL205/DL260 controllers that test machines do not have. The pymodbus profile at [`../../DL260/dl205.json`](../../DL260/dl205.json) already models the device-side quirks (BCD nibbles at known holding-register addresses, CDAB-ordered 32-bit values, C-relay/Y-output coil mappings) as concrete register seeds. The harness wraps that profile in an xUnit `IAsyncLifetime` fixture so every E2E test class opens against a fresh known-good DL-series target without manual setup.
The device-side rationale for each seed (why HR 1072 is `0x1234`, why FC03 caps at 128, etc.) lives in [`../../DL260/dl205.md`](../../DL260/dl205.md). The harness exists to make that profile addressable from xUnit tests; it does not duplicate the device documentation.
## Harness Layout
Three files form the harness contract:
| Path | Role |
|------|------|
| `tests/sim/run-dl205-sim.ps1` | PowerShell launcher. Provisions a Python venv under `tests/sim/.venv` on first run (`python -m venv` + `pip install pymodbus`) and execs `pymodbus.simulator` against `dl205.json` on the requested port. Idempotent — re-runs reuse the venv. |
| `tests/Mbproxy.Tests/Sim/DL205SimulatorFixture.cs` | `IAsyncLifetime` fixture that picks a free port, spawns the launcher, polls for TCP readiness, and tears the process tree down on dispose. |
| `tests/Mbproxy.Tests/Sim/DL205SimulatorCollection.cs` | `[CollectionDefinition(nameof(DL205SimulatorCollection))]` that exposes the fixture as an xUnit `ICollectionFixture<DL205SimulatorFixture>`. |
The launcher is a PowerShell 7+ script; the fixture invokes it via `pwsh -NoProfile -File <script> -Port <picked>`. The script's exit codes (1 = venv failure, 2 = pymodbus launch failure, 3 = profile missing) propagate up through the fixture's stdout/stderr capture for diagnosis.
## Fixture Lifecycle
`DL205SimulatorFixture` is sealed and lives in `Mbproxy.Tests.Sim`. The lifecycle is bounded entirely by `InitializeAsync` and `DisposeAsync`.
### InitializeAsync
`InitializeAsync` runs five steps:
1. **Pick a free local port.** Bind a `TcpListener` on `IPAddress.Loopback:0`, capture the OS-assigned port into `Port`, and dispose the listener. The TOCTOU window between dispose and pymodbus binding is documented in the source and considered acceptable for tests; a port-steal would manifest as a connect failure inside the readiness poll, which then surfaces via `SkipReason`.
2. **Resolve the launcher script.** Walks upward from the test assembly directory (`tests/Mbproxy.Tests/bin/<config>/net10.0/`) looking for `tests/sim/run-dl205-sim.ps1`. If not found, `SkipReason` is set with a "could not locate" message that points at the expected layout.
3. **Verify `pwsh` is on `PATH`.** Spawns `pwsh -NoProfile -Command exit 0` with a 3 s budget. If `pwsh` is missing or returns non-zero, `SkipReason` is set.
4. **Spawn the simulator subprocess.** `Process.Start` invokes the launcher with the picked port; stdout and stderr are drained asynchronously into a 50-line ring buffer via `BeginOutputReadLine` / `BeginErrorReadLine` to avoid blocking the child on a full pipe.
5. **Poll for TCP readiness.** Repeatedly opens `TcpClient.ConnectAsync(Host, Port)` at 100 ms intervals until either the connect succeeds, the process exits early, or `ReadinessTimeout` elapses.
`ReadinessTimeout` is 120 s. The spec calls for "up to 10 s" of warm-run server startup; the fixture allows the longer budget because a cold run that has to `pip install pymodbus` can take 30 to 90 s depending on network speed. Warm runs (the common case) succeed in under 2 s. Cold-run provisioning is additive and cannot be separated without a separate pre-provision step.
If readiness never arrives, the fixture distinguishes two failure modes in `SkipReason`:
- **The process exited prematurely.** Likely cause: Python not found, pymodbus not installed, profile path wrong. The exit code and the log tail are included in the skip reason.
- **The process is still running but never accepted a connection.** Likely cause: port stolen, pymodbus stuck in its own startup, firewall blocking loopback. The log tail is included verbatim.
Either way, the process tree is killed before the skip reason is returned, so no orphan pymodbus survives a failed fixture initialisation.
### DisposeAsync
`DisposeAsync` calls `_process.Kill(entireProcessTree: true)` and then `WaitForExitAsync` with a 5 s cap. Windows lacks a portable graceful `SIGINT` from .NET without P/Invoke into the console-attach APIs; pymodbus's `atexit` handlers may be cut short. The trade-off is documented inline in the fixture — acceptable for test cleanup because the simulator is stateless across runs.
### Public Surface
The fixture exposes four members:
```csharp
public sealed class DL205SimulatorFixture : IAsyncLifetime
{
public string Host { get; } = "127.0.0.1";
public int Port { get; private set; }
public string? SkipReason { get; private set; }
public string LogTail => BuildLogTail();
}
```
`Host` is always loopback. `Port` is the OS-picked free port. `SkipReason` is non-null when the simulator could not start. `LogTail` returns the last 50 lines of captured stdout/stderr for diagnosis when a test fails.
## Skip Policy
If Python is missing, `pwsh` is not on `PATH`, or pip provisioning fails, `InitializeAsync` populates `SkipReason` with a human-readable explanation and the fixture proceeds without a live simulator. Every E2E test starts with the same guard:
```csharp
if (_sim.SkipReason is not null)
Assert.Skip(_sim.SkipReason);
```
The unit-test suite (any test without `[Trait("Category", "E2E")]`) runs without any Python at all. CI machines must have Python 3.10+ and PowerShell 7+; local developers running only unit tests need nothing extra. The phase-01 gate (see [`../plan/README.md`](../plan/README.md)) explicitly verifies that on a machine with Python and pymodbus installed, none of the smoke tests skip — a skip on a properly equipped CI machine is treated as an environment failure, not a test pass.
The skip reasons the fixture produces map cleanly onto the recovery action:
| Skip reason prefix | Cause | Recovery |
|--------------------|-------|----------|
| `Could not locate tests/sim/run-dl205-sim.ps1` | Test assembly is too deep, or the script was deleted | Restore the script; verify the upward search starts inside the repo |
| `pwsh (PowerShell 7+) is not available on PATH` | Windows PowerShell 5.1 is on PATH but not pwsh | Install PowerShell 7+ and ensure `pwsh` resolves |
| `Failed to spawn pwsh: <message>` | `Process.Start` itself failed | Inspect the inner message; usually a `PATH` or permissions issue |
| `Simulator process exited prematurely (exit code N)` | Launcher script returned non-zero (Python missing, pymodbus install failed, profile missing) | Read the log tail; exit codes 1/2/3 map to venv / launch / profile failures |
| `Simulator did not accept a TCP connection on port N within 120 s` | Process is alive but never bound the port | Read the log tail; usually a port-steal or a pymodbus internal hang |
## Adding an E2E Test
An E2E test class declares the collection, the category trait, takes the fixture in its constructor, and guards every test with the skip check. `SimulatorSmokeTests` in `tests/Mbproxy.Tests/Sim/` is the canonical minimal example:
```csharp
[Collection(nameof(DL205SimulatorCollection))]
[Trait("Category", "E2E")]
public sealed class SimulatorSmokeTests
{
private readonly DL205SimulatorFixture _sim;
public SimulatorSmokeTests(DL205SimulatorFixture sim) => _sim = sim;
[Fact(Timeout = 5_000)]
public async Task Simulator_FC03_ReturnsBCD_RawValueAtHR1072_0x1234()
{
if (_sim.SkipReason is not null)
Assert.Skip(_sim.SkipReason);
using var client = new TcpClient();
await client.ConnectAsync(_sim.Host, _sim.Port,
TestContext.Current.CancellationToken);
var master = new ModbusFactory().CreateMaster(client);
ushort[] regs = master.ReadHoldingRegisters(
slaveAddress: 1, startAddress: 1072, numberOfPoints: 1);
Assert.Equal(0x1234, regs[0]); // raw BCD nibbles, NOT binary 1234
}
}
```
For a proxy-shaped test, configure an in-process host pointing at `_sim.Host:_sim.Port` as the backend PLC and drive `NModbus` against the proxy's listen port. `MultiplexerE2ETests` in `tests/Mbproxy.Tests/Proxy/Multiplexing/` is the working example with the in-process `Host.CreateApplicationBuilder()` setup.
## Per-Test Timeout Policy
`[Fact(Timeout = 5_000)]` is the default for every E2E test. Expand per-test only when the test genuinely needs longer — concurrent bursts above 100 ops, reload-propagation debounce, graceful-shutdown drain, Polly-paced backend reconnects. Add a one-line comment on the test explaining the reason whenever the timeout exceeds the 5 s default. `MultiplexerE2ETests.E2E_BackendDisconnect_DuringInflight_CascadesUpstream_AndRecovers` uses `[Fact(Timeout = 8_000)]` and documents the Polly backoff budget inline.
The reason a hard timeout matters: synchronous `NModbus` calls do not honor `TestContext.Current.CancellationToken`. Without `[Fact(Timeout=…)]`, a deadlock anywhere in the proxy hangs the test runner indefinitely. The hang-diagnosis pattern for when this nonetheless happens lives in [`../Operations/Troubleshooting.md`](../Operations/Troubleshooting.md).
The test-runner backstop is a process-level safety net:
```powershell
dotnet test --filter Category=E2E --blame-hang-timeout 2m
```
The `--blame-hang-timeout` is mandatory for E2E runs. It catches the rare case where an individual test's `Timeout` somehow does not fire — for example, an unmanaged thread blocking finalization.
## The Pymodbus 3.13.0 Framer Quirk
Pymodbus 3.13.0's `ServerRequestHandler` stores a single `last_pdu` field per TCP connection and schedules the deferred handler via `asyncio.call_soon`. If two MBAP frames arrive in the same recv buffer — which the multiplexer's shared backend connection produces under truly concurrent upstream reads — the later frame overwrites `last_pdu` before the first scheduled handler runs, and both responses then carry the later request's TxId. The real DL260 ECOM does not exhibit this bug; it echoes per-request TxIds correctly.
This forces a three-part test strategy:
- **Multiplexer correctness under concurrent backend traffic is proven against a stub backend.** `PlcMultiplexerTests` drives the multiplexer with a stub that properly echoes per-request TxIds. That test class is the load-bearing coverage for TxId rewriting; the simulator does not contribute here.
- **Simulator-backed E2E tests pace requests** to keep pymodbus in known-good single-PDU mode. `MultiplexerE2ETests` enforces serialisation explicitly: each upstream client's request is issued only after the previous client's response has returned. The `<summary>` block at the top of `MultiplexerE2ETests.cs` documents the trade-off and points readers to `PlcMultiplexerTests` for the concurrency proof.
- **The per-request watchdog defends production.** Configurable via `Connection.BackendRequestTimeoutMs`, the multiplexer cancels any in-flight request whose response does not arrive within the budget and surfaces Modbus exception `0x0B` upstream. The `mbproxy.multiplex.request.timeout` log event (see [`../Reference/LogEvents.md`](../Reference/LogEvents.md)) is the operational signal. The same code path defends against any backend — pymodbus, a misbehaving ECOM, or a network middlebox — that mis-echoes or drops a TxId.
The connection-model rationale for why the multiplexer produces multi-frame recv buffers in the first place is in [`../Architecture/ConnectionModel.md`](../Architecture/ConnectionModel.md).
## Simulator Profile
`DL260/dl205.json` is the pymodbus server config. It seeds the registers the E2E tests assert against:
| Address | Width | Seeded value | Used to prove |
|---------|-------|--------------|---------------|
| HR 0 | uint16 | `0xCAFE` | Profile is loaded; register 0 is valid on DL205/DL260 |
| HR 200..209 | uint16 | scratch range, writable | FC06/FC16 round-trips for BCD-rewriter E2E tests |
| HR 1072 | uint16 | `0x1234` (raw BCD nibbles) | Single-register FC03 BCD decode through the proxy |
| HR 1080/1081 | uint16 pair | CDAB-ordered 32-bit BCD | 32-bit BCD decode across the word pair |
The full register map and the device-side rationale for each entry live in [`../../DL260/dl205.md`](../../DL260/dl205.md).
Two profile-level settings are load-bearing for the harness:
- **`"shared blocks": true`** — matches the DL series memory model where holding registers and input registers share the same backing store. The proxy's tests assume this; switching it off would change which addresses appear via FC03 versus FC04.
- **`"type exception": false`** — controls whether pymodbus raises an exception when an address is read via a function code that does not match its declared type. The default `false` is the lax behaviour the tests rely on. Flipping it to `true` is an alternate-profile scenario, not a default-profile change.
The `write` block in the JSON controls which ranges accept FC06/FC16. Writes outside the listed ranges return Modbus exception 02 (illegal data address), which is itself a useful condition the proxy must forward correctly. `MultiplexerE2ETests.E2E_RewriterStillWorks_UnderMultiplexedThreeClients` uses addresses 200, 202, and 204 from the writable scratch range for exactly this reason — read-only addresses would return exception 02 on the write step and break the round-trip assertion.
## Alternate Profiles
The `MODBUS_SIM_PROFILE` environment variable selects an alternate profile alongside `dl205.json`. This is the seam for scenario-specific simulators — for example, a profile with `"type exception": true` to verify the proxy does not depend on the default lax pymodbus behaviour, or a profile that seeds a specific partial-overlap test case at a known address. The existing pattern is `DL260/DL205BcdQuirkTests.cs`, which already drives the simulator with profile-driven assertions. When a new scenario needs its own profile, drop the JSON alongside `dl205.json` and select it via the env var rather than swapping the default — the default profile is the contract for the smoke tests and `MultiplexerE2ETests` and should not be silently mutated.
## Running the Simulator Standalone
The launcher is usable outside xUnit for ad-hoc debugging:
```powershell
pwsh tests/sim/run-dl205-sim.ps1 -Port 5020
```
The script provisions the venv on first run and execs `pymodbus.simulator`. Output streams to the terminal; Ctrl-C exits cleanly because the pymodbus process is attached to the script's console group. Useful for poking at the profile with an external Modbus client (e.g. ModScan, mbpoll, NModbus from `dotnet fsi`) without running the test harness.
A typical debugging loop:
1. Launch the simulator standalone on a fixed port.
2. Point a manually built proxy host at it via `appsettings.json` with `Host=127.0.0.1, Port=5020`.
3. Drive the proxy from a Modbus client and inspect log events at `Verbose` to see the rewriter in action.
The standalone launcher uses the same script the fixture invokes, so behaviour is identical between the test harness and ad-hoc runs.
## End-to-End Test Shape
A full E2E test wires the simulator, an in-process proxy host, and an `NModbus` client into the same loopback stack:
```csharp
// 1. Simulator runs at _sim.Host:_sim.Port (fixture-managed).
// 2. Build an in-process proxy host pointing at the simulator as its PLC backend.
int proxyPort = PickFreePort();
var config = new Dictionary<string, string?>
{
["Mbproxy:AdminPort"] = "0",
[$"Mbproxy:Plcs:0:Name"] = "TestPLC",
[$"Mbproxy:Plcs:0:ListenPort"] = proxyPort.ToString(),
[$"Mbproxy:Plcs:0:Host"] = _sim.Host,
[$"Mbproxy:Plcs:0:Port"] = _sim.Port.ToString(),
["Mbproxy:BcdTags:Global:0:Address"] = "1072",
["Mbproxy:BcdTags:Global:0:Width"] = "16",
};
var host = BuildBcdHost(config);
await host.StartAsync(startCts.Token);
// 3. Drive NModbus against the proxy port.
using var client = new TcpClient();
await client.ConnectAsync("127.0.0.1", proxyPort,
TestContext.Current.CancellationToken);
var master = new ModbusFactory().CreateMaster(client);
// 4a. Read: simulator returns raw 0x1234, proxy rewrites to binary 1234.
ushort[] regs = master.ReadHoldingRegisters(1, 1072, 1);
regs[0].ShouldBe((ushort)1234);
// 4b. Write (against a writable scratch address, e.g. HR 200):
// client writes binary 1234, proxy re-encodes to 0x1234 BCD nibbles,
// simulator stores the nibbles.
master.WriteSingleRegister(1, 200, 1234);
```
The read direction proves the proxy rewrote the response; the write direction proves the proxy rewrote the request. Both assertions running against the same simulator instance is the smallest viable end-to-end signal that the BCD rewriter is correctly wired.
## Related Documentation
- [Connection Model](../Architecture/ConnectionModel.md) — why the multiplexer's shared backend connection produces the multi-frame condition that triggers pymodbus's framer quirk
- [Troubleshooting](../Operations/Troubleshooting.md) — hang-diagnosis pattern for tests that exceed their `[Fact(Timeout)]`
- [Log Events](../Reference/LogEvents.md) — `mbproxy.multiplex.request.timeout` is the production watchdog against TxId mis-echo
- [DL205/DL260 device quirks](../../DL260/dl205.md) — device-side rationale for every register the simulator profile seeds
- [Phase plan README](../plan/README.md) — Test discipline section that codifies the 5 000 ms default and the `--blame-hang-timeout` rule