diff --git a/code-reviews/Client.Dotnet/findings.md b/code-reviews/Client.Dotnet/findings.md index bb40652..bac9b6a 100644 --- a/code-reviews/Client.Dotnet/findings.md +++ b/code-reviews/Client.Dotnet/findings.md @@ -5,9 +5,9 @@ | Module | `clients/dotnet` | | Reviewer | Claude Code | | Review date | 2026-05-24 | -| Commit reviewed | `d692232` | -| Status | Reviewed | -| Open findings | 0 | +| Commit reviewed | `42b0037` | +| Status | Re-reviewed | +| Open findings | 4 | ## Checklist coverage @@ -336,3 +336,166 @@ The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this co **Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events. **Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts. + +#### 2026-05-24 re-review (commit 42b0037) + +Re-review pass at `42b0037`. Diff against `d692232` consists of four commits: + +- `71d2c39` — ports the `batch` subcommand to all five client CLIs (the .NET + half adds `RunBatchAsync`, the `__MXGW_BATCH_EOR__` sentinel, the + `forceJsonErrors` plumbing on `RunCoreAsync`, and the two + `RunAsync_Batch_*` tests). +- `b3ae200` — ports the bulk read/write SDK methods + matching CLI + subcommands (the SDK side adds `WriteBulkAsync` / `Write2BulkAsync` / + `WriteSecuredBulkAsync` / `WriteSecured2BulkAsync` / `ReadBulkAsync` on + `MxGatewaySession`; the CLI side adds `read-bulk`, `write-bulk`, + `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and the + cross-language stress harness `bench-read-bulk`). +- `11cc671` — ports `stream-alarms` and `acknowledge-alarm` to the CLI and + the SDK transport (`StreamAlarmsAsync` on `MxGatewayClient`, + `IMxGatewayClientTransport`, `GrpcMxGatewayClientTransport`, + `IMxGatewayCliClient`, `MxGatewayCliClientAdapter`, and + `FakeGatewayTransport`; two new `MxGatewayClientCliTests` covering + acknowledge + stream-alarms with `--max-events` and payload-case + distinction). The same commit wraps `StreamEventsAsync`'s `await foreach` + with `catch (OperationCanceledException) when + (cancellationToken.IsCancellationRequested)` (matching the Client.Dotnet-017 + resolution shape) and mirrors the catch on the new `StreamAlarmsAsync` + CLI handler. `RunBatchAsync`'s outer catch is `catch (Exception + exception)` with no `when` guard — OCE thrown out of `RunCoreAsync` is + caught here and written as a JSON error so the batch process keeps + going, while the inner `RunCoreAsync` catch retains its `when + (exception is not OperationCanceledException)` guard so the non-batch + CLI behaviour is unchanged. +- `8738735` — updates each client's README with `stream-alarms` / + `acknowledge-alarm` usage and SDK examples. + +| # | Category | Result | +|---|---|---| +| 1 | Correctness & logic bugs | Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021). | +| 2 | mxaccessgw conventions | Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018). | +| 3 | Concurrency & thread safety | No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`. | +| 4 | Error handling & resilience | Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly. | +| 5 | Security | No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths. | +| 6 | Performance & resource management | No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category). | +| 7 | Design-document adherence | No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less. | +| 8 | Code organization & conventions | No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern. | +| 9 | Testing coverage | No new issues — `RunAsync_StreamAlarms_*`, `RunAsync_AcknowledgeAlarm_*`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here. | +| 10 | Documentation & comments | Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete. | + +### Client.Dotnet-018 + +| Field | Value | +|---|---| +| Severity | Medium | +| Category | Documentation & comments | +| Location | `clients/dotnet/README.md:137-138` | +| Status | Open | + +**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows: + +``` +mxgw-dotnet stream-alarms --session-id --max-messages 1 --json +mxgw-dotnet acknowledge-alarm --session-id --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json +``` + +None of these flags exist. The actual CLI consumes `--filter-prefix`, `--max-events`, `[--jsonl]` for `stream-alarms` and `--reference`, `--comment`, `--operator`, `--json` for `acknowledge-alarm` (see `WriteUsage` lines 891-892 and the `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` handlers in `MxGatewayClientCli.cs`). Neither command takes a `--session-id` — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on `MxGatewayClient.StreamAlarmsAsync` lines 226-237). A user who copies the README example will get a `--session-id` "unknown option" failure on both commands and an "Argument 'reference' is required" failure on `acknowledge-alarm` (since `--alarm-reference` is silently treated as `--alarm-reference` rather than `--reference`). + +This is exactly the docs-in-the-same-change rule from `CLAUDE.md` ("Update docs in the same change as the source") — the README change in `8738735` documents the wrong surface for the source change in `11cc671`. The built-in `WriteUsage` text is correct; only the README walkthrough is wrong. + +**Recommendation:** Replace the two example lines with the actual flag names: + +``` +mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json +mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json +``` + +A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard. + +### Client.Dotnet-019 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` | +| Status | Open | + +**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745: + +```csharp +int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value; +``` + +The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`, steady-state `ReadBulk`, `UnsubscribeBulk` — against a zero `ServerHandle`. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading `cachedReadResults = 0` summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level `RequireRegisterServerHandle` / `RequireAddItemItemHandle` helper was introduced, but neither helper currently exists in the codebase, and the SDK-level `MxGatewaySession.RegisterAsync` / `AddItemAsync` / `AddItem2Async` at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance. + +**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection. + +### Client.Dotnet-020 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Error handling & resilience | +| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` | +| Status | Open | + +**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`: + +```csharp +try +{ + reply = await client.InvokeAsync( + CreateCommandRequest(sessionId, readBulkMxCommand), + cancellationToken) + .ConfigureAwait(false); + sw.Stop(); +} +catch +{ + sw.Stop(); + failedCalls++; + latencyMillis.Add(sw.Elapsed.TotalMilliseconds); + continue; +} +``` + +The `catch` has no type filter and no `when (!cancellationToken.IsCancellationRequested)` clause, so `OperationCanceledException` flowing from a cancelled `cancellationToken` (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until `DateTime.UtcNow` reaches `steadyDeadline`, with each iteration immediately throwing OCE from the next `InvokeAsync` call — a tight CPU-burn until the configured `--duration-seconds` elapses, producing a per-iteration `latencyMillis.Add(sw.Elapsed.TotalMilliseconds)` flood that also skews the final p99/max numbers down. + +The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the `finally` block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on `cancellationToken.IsCancellationRequested`. Note that Client.Dotnet-015's resolution removed the outer `CancelAfter(timeout)` for `bench-read-bulk` / `bench-stream-events`, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit. + +**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture. + +### Client.Dotnet-021 + +| Field | Value | +|---|---| +| Severity | Low | +| Category | Correctness & logic bugs | +| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` | +| Status | Open | + +**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking: + +```csharp +// ReadBulkAsync (line 487) +TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0), + +// BenchReadBulkAsync (line 715) +uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500); +``` + +`arguments.GetInt32(...)` returns `Int32.Parse(...)`, so a caller supplying `--timeout-ms=-1` (an easy mistake for "no timeout" or copy-paste from another tool that uses `-1` for unbounded) silently wraps to `0xFFFFFFFF` = ~49.7 days as the worker-side `ReadBulk` timeout. The gateway forwards that as the per-tag `timeout_ms` on `ReadBulkCommand`, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's `uint32 timeout_ms` field was meant to prevent. The `ReadBulkAsync` SDK overload on `MxGatewaySession` (line 663) already guards correctly: `timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue)`; the CLI handlers should match. + +**Recommendation:** Wrap the cast in a guard, e.g.: + +```csharp +int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0); +if (timeoutMsRaw < 0) +{ + throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default)."); +} +uint timeoutMs = (uint)timeoutMsRaw; +``` + +A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.