Resolve Client.Dotnet-018..021: README + bench-read-bulk hardening

Client.Dotnet-018  README CLI examples for stream-alarms / acknowledge-alarm
                   replaced with parser-correct flags; new theory test
                   parses each documented README example through the CLI.
Client.Dotnet-019  BenchReadBulkAsync routes through new
                   RequireRegisterServerHandle helper that fails loudly when
                   the OK register reply has no typed payload.
Client.Dotnet-020  Bench steady-state catch is now
                   catch (Exception ex) when (ex is not OperationCanceledException)
                   so user-driven cancellation exits promptly.
Client.Dotnet-021  --timeout-ms now flows through ParseTimeoutMs which
                   rejects negatives with a clear error in both read-bulk
                   and bench-read-bulk.

All resolved at 2026-05-24; 67/67 .NET client tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 08:49:45 -04:00
parent 4d77279e7e
commit 712cb06442
4 changed files with 423 additions and 11 deletions
+13 -5
View File
@@ -7,7 +7,7 @@
| Review date | 2026-05-24 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 4 |
| Open findings | 0 |
## Checklist coverage
@@ -390,7 +390,7 @@ Re-review pass at `42b0037`. Diff against `d692232` consists of four commits:
| Severity | Medium |
| Category | Documentation & comments |
| Location | `clients/dotnet/README.md:137-138` |
| Status | Open |
| Status | Resolved |
**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows:
@@ -412,6 +412,8 @@ mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" -
A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.
**Resolution:** 2026-05-24 — Confirmed against source: the README `dotnet run -- stream-alarms` example used `--session-id <id> --max-messages 1`, neither of which the CLI accepts (the production parser routes through `--filter-prefix` / `--max-events`), and `acknowledge-alarm` used `--session-id <id> --alarm-reference <ref>` against a session-less central alarm monitor that actually consumes `--reference`. Replaced both README example lines with the parser-correct shape: `stream-alarms --filter-prefix Area001 --max-events 1 --json` and `acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json`. Regression test `MxGatewayClientCliTests.RunAsync_ReadmeExamples_ForAlarmCommands_ParseSuccessfully` (xUnit `[Theory]` over `stream-alarms` and `acknowledge-alarm`) locates `clients/dotnet/README.md`, extracts the documented CLI command line for each subcommand, tokenizes it (preserving quoted segments), and drives it through the production `MxGatewayClientCli.RunAsync` against a `FakeCliClient`; the test asserts exit code 0 and that stderr contains neither "Unknown command" nor "Missing required option". Verified red against the original README text (both theory rows failed with exit code 1) and green after the README update, so any future documentation drift between README examples and the actual parser shape is caught at test time.
### Client.Dotnet-019
| Field | Value |
@@ -419,7 +421,7 @@ A quick sanity check would be to drive each example through the test harness's `
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` |
| Status | Open |
| Status | Resolved |
**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745:
@@ -431,6 +433,8 @@ The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`
**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection.
**Resolution:** 2026-05-24 — Confirmed against source: `BenchReadBulkAsync` at line 745 carried the silent fallback `registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value`, so a successful protocol+MX-status reply missing its typed `register` payload would yield ServerHandle=0 and the bench would drive the rest of SubscribeBulk / warmup / steady-state ReadBulk / UnsubscribeBulk against an invalid handle. Added a private `RequireRegisterServerHandle(MxCommandReply reply, string sessionId)` helper that throws a descriptive `MxGatewayException` naming the missing typed payload, the session id, and the correlation id, then replaced the fallback site with `int serverHandle = RequireRegisterServerHandle(registerReply, sessionId);`. The bench's outer `RunCoreAsync` catch then surfaces the throw as exit code 1 plus the descriptive message on stderr, so the failure mode is loud rather than a wall of zero-result stats. Regression test `MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenRegisterReplyMissingTypedPayload_FailsLoudly` enqueues an `Ok` MX command reply with no typed `Register` payload, runs `bench-read-bulk` against the fake, and asserts exit code 1, that stderr names the `Register` payload, and that no `bench-read-bulk` stats JSON was emitted on stdout. Verified red against the original fallback (an NRE bubbled up later in the run rather than the descriptive throw) and green after the helper landed.
### Client.Dotnet-020
| Field | Value |
@@ -438,7 +442,7 @@ The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` |
| Status | Open |
| Status | Resolved |
**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`:
@@ -466,6 +470,8 @@ The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE
**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture.
**Resolution:** 2026-05-24 — Confirmed against source: the steady-state loop wrapped `client.InvokeAsync` in a bare `catch { sw.Stop(); failedCalls++; latencyMillis.Add(...); continue; }` with no type filter, so an `OperationCanceledException` thrown from a cancelled token (Ctrl+C, parent CTS, or the wall-clock budget) was swallowed and the loop spun until `DateTime.UtcNow >= steadyDeadline`. Replaced the bare clause with `catch (Exception ex) when (ex is not OperationCanceledException)`, so OCE now propagates out of the bench and unwinds through the outer `RunCoreAsync` (which intentionally lets OCE escape). Inline comment names the finding and the reason the filter exists. Regression test `MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenSteadyStateLoopReceivesCancellation_ExitsPromptly` configures the fake CLI client so Register and SubscribeBulk succeed, the first three ReadBulk calls succeed (the loop enters the steady-state body), then every subsequent ReadBulk throws `OperationCanceledException`; the test runs `bench-read-bulk --duration-seconds 30` and asserts the call throws OCE and wall-clock elapsed time stays under 10 s. Verified red against the original bare `catch` (the test ran for the full 30 s without throwing) and green after the filter landed (the bench exited promptly with OCE).
### Client.Dotnet-021
| Field | Value |
@@ -473,7 +479,7 @@ The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` |
| Status | Open |
| Status | Resolved |
**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking:
@@ -499,3 +505,5 @@ uint timeoutMs = (uint)timeoutMsRaw;
```
A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.
**Resolution:** 2026-05-24 — Confirmed against source: both `ReadBulkAsync` (line 490) and `BenchReadBulkAsync` (line 715) cast `arguments.GetInt32("timeout-ms", ...)` straight to `uint`, so `--timeout-ms -1` silently wrapped to `0xFFFFFFFF` (~49.7 days). Added a single shared private helper `ParseTimeoutMs(CliArguments arguments, int defaultValue)` on `MxGatewayClientCli` that reads the int32, rejects negatives with a clear `ArgumentException` ("--timeout-ms must be a non-negative integer (use 0 for the gateway default)."), and returns the safe `(uint)`. Both call sites now route through the helper. Regression test `MxGatewayClientCliTests.RunAsync_TimeoutMs_NegativeValue_RejectsWithClearError` (xUnit `[Theory]` over `read-bulk` and `bench-read-bulk`) drives the CLI with `--timeout-ms -1` and asserts the exit code is non-zero, that stderr contains "timeout-ms", and that the "non-negative" guard text is present. Verified red against the original `(uint)arguments.GetInt32(...)` casts (the bench proceeded past the timeout parse and tripped a downstream "Queue empty" error rather than the descriptive guard message) and green after the helper landed.