code-reviews: re-review Client.Dotnet at 42b0037

Append 4 new findings (Client.Dotnet-018..021): README flags for the
new stream-alarms/acknowledge-alarm subcommands cite options that do
not exist on the CLI; BenchReadBulkAsync reinstates the silent
register-handle fallback and swallows OperationCanceledException;
both new --timeout-ms consumers cast int32 to uint without bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-24 08:28:40 -04:00
parent 7d66967122
commit b9ef09d26e
+166 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/dotnet` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 4 |
## Checklist coverage
@@ -336,3 +336,166 @@ The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this co
**Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.
**Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.
#### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. Diff against `d692232` consists of four commits:
- `71d2c39` — ports the `batch` subcommand to all five client CLIs (the .NET
half adds `RunBatchAsync`, the `__MXGW_BATCH_EOR__` sentinel, the
`forceJsonErrors` plumbing on `RunCoreAsync`, and the two
`RunAsync_Batch_*` tests).
- `b3ae200` — ports the bulk read/write SDK methods + matching CLI
subcommands (the SDK side adds `WriteBulkAsync` / `Write2BulkAsync` /
`WriteSecuredBulkAsync` / `WriteSecured2BulkAsync` / `ReadBulkAsync` on
`MxGatewaySession`; the CLI side adds `read-bulk`, `write-bulk`,
`write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and the
cross-language stress harness `bench-read-bulk`).
- `11cc671` — ports `stream-alarms` and `acknowledge-alarm` to the CLI and
the SDK transport (`StreamAlarmsAsync` on `MxGatewayClient`,
`IMxGatewayClientTransport`, `GrpcMxGatewayClientTransport`,
`IMxGatewayCliClient`, `MxGatewayCliClientAdapter`, and
`FakeGatewayTransport`; two new `MxGatewayClientCliTests` covering
acknowledge + stream-alarms with `--max-events` and payload-case
distinction). The same commit wraps `StreamEventsAsync`'s `await foreach`
with `catch (OperationCanceledException) when
(cancellationToken.IsCancellationRequested)` (matching the Client.Dotnet-017
resolution shape) and mirrors the catch on the new `StreamAlarmsAsync`
CLI handler. `RunBatchAsync`'s outer catch is `catch (Exception
exception)` with no `when` guard — OCE thrown out of `RunCoreAsync` is
caught here and written as a JSON error so the batch process keeps
going, while the inner `RunCoreAsync` catch retains its `when
(exception is not OperationCanceledException)` guard so the non-batch
CLI behaviour is unchanged.
- `8738735` — updates each client's README with `stream-alarms` /
`acknowledge-alarm` usage and SDK examples.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021). |
| 2 | mxaccessgw conventions | Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018). |
| 3 | Concurrency & thread safety | No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`. |
| 4 | Error handling & resilience | Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly. |
| 5 | Security | No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths. |
| 6 | Performance & resource management | No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category). |
| 7 | Design-document adherence | No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less. |
| 8 | Code organization & conventions | No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern. |
| 9 | Testing coverage | No new issues — `RunAsync_StreamAlarms_*`, `RunAsync_AcknowledgeAlarm_*`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here. |
| 10 | Documentation & comments | Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete. |
### Client.Dotnet-018
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Documentation & comments |
| Location | `clients/dotnet/README.md:137-138` |
| Status | Open |
**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows:
```
mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json
mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
```
None of these flags exist. The actual CLI consumes `--filter-prefix`, `--max-events`, `[--jsonl]` for `stream-alarms` and `--reference`, `--comment`, `--operator`, `--json` for `acknowledge-alarm` (see `WriteUsage` lines 891-892 and the `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` handlers in `MxGatewayClientCli.cs`). Neither command takes a `--session-id` — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on `MxGatewayClient.StreamAlarmsAsync` lines 226-237). A user who copies the README example will get a `--session-id` "unknown option" failure on both commands and an "Argument 'reference' is required" failure on `acknowledge-alarm` (since `--alarm-reference` is silently treated as `--alarm-reference` rather than `--reference`).
This is exactly the docs-in-the-same-change rule from `CLAUDE.md` ("Update docs in the same change as the source") — the README change in `8738735` documents the wrong surface for the source change in `11cc671`. The built-in `WriteUsage` text is correct; only the README walkthrough is wrong.
**Recommendation:** Replace the two example lines with the actual flag names:
```
mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json
mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json
```
A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.
### Client.Dotnet-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` |
| Status | Open |
**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745:
```csharp
int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;
```
The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`, steady-state `ReadBulk`, `UnsubscribeBulk` — against a zero `ServerHandle`. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading `cachedReadResults = 0` summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level `RequireRegisterServerHandle` / `RequireAddItemItemHandle` helper was introduced, but neither helper currently exists in the codebase, and the SDK-level `MxGatewaySession.RegisterAsync` / `AddItemAsync` / `AddItem2Async` at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance.
**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection.
### Client.Dotnet-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` |
| Status | Open |
**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`:
```csharp
try
{
reply = await client.InvokeAsync(
CreateCommandRequest(sessionId, readBulkMxCommand),
cancellationToken)
.ConfigureAwait(false);
sw.Stop();
}
catch
{
sw.Stop();
failedCalls++;
latencyMillis.Add(sw.Elapsed.TotalMilliseconds);
continue;
}
```
The `catch` has no type filter and no `when (!cancellationToken.IsCancellationRequested)` clause, so `OperationCanceledException` flowing from a cancelled `cancellationToken` (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until `DateTime.UtcNow` reaches `steadyDeadline`, with each iteration immediately throwing OCE from the next `InvokeAsync` call — a tight CPU-burn until the configured `--duration-seconds` elapses, producing a per-iteration `latencyMillis.Add(sw.Elapsed.TotalMilliseconds)` flood that also skews the final p99/max numbers down.
The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the `finally` block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on `cancellationToken.IsCancellationRequested`. Note that Client.Dotnet-015's resolution removed the outer `CancelAfter(timeout)` for `bench-read-bulk` / `bench-stream-events`, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit.
**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture.
### Client.Dotnet-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` |
| Status | Open |
**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking:
```csharp
// ReadBulkAsync (line 487)
TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0),
// BenchReadBulkAsync (line 715)
uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500);
```
`arguments.GetInt32(...)` returns `Int32.Parse(...)`, so a caller supplying `--timeout-ms=-1` (an easy mistake for "no timeout" or copy-paste from another tool that uses `-1` for unbounded) silently wraps to `0xFFFFFFFF` = ~49.7 days as the worker-side `ReadBulk` timeout. The gateway forwards that as the per-tag `timeout_ms` on `ReadBulkCommand`, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's `uint32 timeout_ms` field was meant to prevent. The `ReadBulkAsync` SDK overload on `MxGatewaySession` (line 663) already guards correctly: `timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue)`; the CLI handlers should match.
**Recommendation:** Wrap the cast in a guard, e.g.:
```csharp
int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0);
if (timeoutMsRaw < 0)
{
throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default).");
}
uint timeoutMs = (uint)timeoutMsRaw;
```
A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.