712cb06442
Client.Dotnet-018 README CLI examples for stream-alarms / acknowledge-alarm
replaced with parser-correct flags; new theory test
parses each documented README example through the CLI.
Client.Dotnet-019 BenchReadBulkAsync routes through new
RequireRegisterServerHandle helper that fails loudly when
the OK register reply has no typed payload.
Client.Dotnet-020 Bench steady-state catch is now
catch (Exception ex) when (ex is not OperationCanceledException)
so user-driven cancellation exits promptly.
Client.Dotnet-021 --timeout-ms now flows through ParseTimeoutMs which
rejects negatives with a clear error in both read-bulk
and bench-read-bulk.
All resolved at 2026-05-24; 67/67 .NET client tests pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
510 lines
63 KiB
Markdown
510 lines
63 KiB
Markdown
# Code Review — Client.Dotnet
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Module | `clients/dotnet` |
|
||
| Reviewer | Claude Code |
|
||
| Review date | 2026-05-24 |
|
||
| Commit reviewed | `42b0037` |
|
||
| Status | Re-reviewed |
|
||
| Open findings | 0 |
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issue found (this review): the global CLI `--timeout` defaults to 30 s and is used both as the gRPC `DefaultCallTimeout` and as the outer `CancelAfter` budget — but `bench-read-bulk` / `bench-stream-events` default to `--duration-seconds=30 --warmup-seconds=3 (+ stagger)`, so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The `scripts/bench-read-bulk.ps1` driver works around this by raising `--timeout`, but `bench-stream-events` has no driver script. |
|
||
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. The new `clients/dotnet/Directory.Build.props` mirrors `src/Directory.Build.props` exactly (same six properties, identical values) so the enforcement floor is back in scope. |
|
||
| 3 | Concurrency & thread safety | Issue found (this review): `BenchStreamEventsAsync`'s per-session `RunStreamAsync` hands the inner `Task.Run` stream loop a reference (`streamTask`) that becomes unobserved whenever the outer `cancellationToken` cancels during the bench's `await Task.Delay` — the `await streamTask` recovery path never runs, so any inner OCE / `RpcException` raised after cancellation surfaces as a `TaskScheduler.UnobservedTaskException` (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied. |
|
||
| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; `RpcExceptionMapper` is consistently called from both gateway and Galaxy transports incl. `AcknowledgeAlarmAsync` after Client.Dotnet-014). |
|
||
| 5 | Security | Good — API key never logged by the library, CLI redacts effective key (both `--api-key` and `--api-key-env` sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged. |
|
||
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op. |
|
||
| 7 | Design-document adherence | No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`. |
|
||
| 8 | Code organization & conventions | No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on `DiscoverHierarchyOptions`, the second `DiscoverHierarchyAsync` overload, and `IMxGatewayCliClient`) are both fully resolved; the new props file is a faithful mirror of the production one. |
|
||
| 9 | Testing coverage | No new issues — Client.Dotnet-014 closed the alarm-side `Translate` gap. The new bench paths (`bench-read-bulk`, `bench-stream-events`) have no unit-test coverage, but they are stress harnesses driven by `scripts/bench-read-bulk.ps1`, not SDK API surface, so this is not flagged. |
|
||
| 10 | Documentation & comments | No new issues this review (Client.Dotnet-007's alarm-ack `admin`-scope correction holds; `DefaultCallTimeout` doc accurately reflects the shared-budget semantics from Client.Dotnet-004). |
|
||
|
||
### 2026-05-24 review (commit d692232)
|
||
|
||
Re-review pass at `d692232`. Diff against `a020350` consists of the `ZB.MOM.WW`
|
||
client prefix rename in commit `397d3c5` (folders, csprojs, sln→slnx, every
|
||
namespace and using) plus the hand-written `DiscoverHierarchyOptions.cs` POCO
|
||
and the dropped retired `SessionId =` lines from alarm-related test fixtures.
|
||
The rename was applied via a case-insensitive regex sweep; no over-rename
|
||
artifacts found. The `mxgw_*` API-key wire prefix, `MXGATEWAY_*` environment
|
||
variables, and the `MxGatewayClient` / `MxGatewaySession` type names are
|
||
unchanged. Build and tests are green at HEAD.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. |
|
||
| 2 | mxaccessgw conventions | No issues found — rename hygiene clean; wire identifiers preserved. |
|
||
| 3 | Concurrency & thread safety | No issues found in this diff. |
|
||
| 4 | Error handling & resilience | No issues found in this diff. |
|
||
| 5 | Security | No issues found in this diff. |
|
||
| 6 | Performance & resource management | No issues found in this diff. |
|
||
| 7 | Design-document adherence | No issues found in this diff — `DotnetClientDesign.md` reflects the new layout. |
|
||
| 8 | Code organization & conventions | No issues found in this diff. |
|
||
| 9 | Testing coverage | No issues found in this diff — `MxGatewayClientAlarmsTests` fixtures correctly drop `SessionId` from `AcknowledgeAlarmRequest`/`Reply` and retain it on `QueryActiveAlarmsRequest`. |
|
||
| 10 | Documentation & comments | No issues found in this diff. |
|
||
|
||
## Findings
|
||
|
||
### Client.Dotnet-001
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `MapRpcException` only produces typed exceptions for `Unauthenticated` and `PermissionDenied`. Every other gRPC status — `NotFound`, `InvalidArgument`, `ResourceExhausted`, `FailedPrecondition`, `Unavailable`, `Internal` — collapses into the base `MxGatewayException` with no surfaced `StatusCode`. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into `InnerException` and downcasting to `RpcException`.
|
||
|
||
**Recommendation:** Carry the gRPC `StatusCode` on `MxGatewayException` (e.g. a `StatusCode` property) and/or add typed subclasses for at least `NotFound`, `InvalidArgument`, and `Unavailable`. Populate it from `exception.StatusCode` in `MapRpcException`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: both transports had a duplicated private `MapRpcException` that only typed two statuses and discarded the gRPC code for the rest. Added a nullable `StatusCode` property (`Grpc.Core.StatusCode?`) to `MxGatewayException` plus constructors that carry it, threaded it through `MxGatewayAuthenticationException`/`MxGatewayAuthorizationException`, and extracted the two duplicated mappers into a single shared internal `RpcExceptionMapper` (`RpcExceptionMapper.cs`) that populates `StatusCode` from `exception.StatusCode` for every status. Callers can now distinguish transient from permanent failures without downcasting `InnerException`. Documented in `clients/dotnet/README.md`. Regression test: `RpcExceptionMapperTests` (8 cases incl. the `[Theory]` over `NotFound`/`InvalidArgument`/`ResourceExhausted`/`FailedPrecondition`/`Unavailable`/`Internal`).
|
||
|
||
### Client.Dotnet-002
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Testing coverage |
|
||
| Location | `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The retry predicate `MxGatewayClientRetryPolicy.IsTransientGrpcFailure` handles two shapes: a raw `RpcException` and an `MxGatewayException { InnerException: RpcException }`. In production the transport always maps `RpcException` → `MxGatewayException` before it reaches the retry pipeline, so only the wrapped-`MxGatewayException` branch ever runs in production. But `FakeGatewayTransport` throws the raw `RpcException` and never maps it, so every retry test exercises only the raw-`RpcException` branch — the branch that never occurs in production. The production retry behaviour is effectively untested.
|
||
|
||
**Recommendation:** Add a fake/transport mode that maps `RpcException` to `MxGatewayException` the way `GrpcMxGatewayClientTransport` does (or add tests that enqueue a pre-wrapped `MxGatewayException`), so the actually-used predicate branch is covered.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `FakeGatewayTransport` threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-`RpcException` predicate branch. Added a `MapTransportExceptions` flag to `FakeGatewayTransport` that, when set, runs thrown `RpcException`s through the same shared `RpcExceptionMapper` the production gRPC transport uses, producing the wrapped `MxGatewayException` shape. Added regression test `MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException`, which exercises the previously-untested production predicate branch. Verified red: removing the `MxGatewayException { InnerException: RpcException }` case from `IsTransientGrpcFailure` fails the new test while the pre-existing raw-`RpcException` test still passes.
|
||
|
||
### Client.Dotnet-003
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `DisposeAsync` calls `CloseAsync()` (no token) then unconditionally `_closeLock.Dispose()`. If another thread is concurrently awaiting `CloseAsync(token)` — legal, since the type exposes public async methods and no single-threaded contract — disposing the `SemaphoreSlim` while a `WaitAsync` is pending throws `ObjectDisposedException` into that caller. The `_disposed` flags in both clients are also plain unsynchronised `bool` reads/writes; `ThrowIfDisposed` racing `DisposeAsync` can observe a stale value.
|
||
|
||
**Recommendation:** Either document `MxGatewaySession`/`MxGatewayClient` as not thread-safe for concurrent dispose, or guard `_disposed` with `Interlocked`/`volatile` and avoid disposing `_closeLock` until all in-flight `CloseAsync` calls complete.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewaySession.DisposeAsync` disposed `_closeLock` unconditionally, racing concurrent `CloseAsync` callers; `MxGatewayClient._disposed` was a plain `bool`. Fixed `MxGatewaySession` by tracking in-flight `CloseAsync` callers with an `_activeCloseCount` guarded by a dedicated `_disposeGate` lock and a `_closeLockDisposed` flag: `CloseAsync` registers under the gate (and throws `ObjectDisposedException` if disposal already won) before awaiting `_closeLock.WaitAsync`, and `DisposeAsync` drains `_activeCloseCount` to zero before disposing the semaphore, so the close lock provably outlives every pending `WaitAsync`. Fixed `MxGatewayClient` by changing `_disposed` to an `int` accessed via `Interlocked.Exchange`/`Volatile.Read`. Regression test `MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync` runs 100 iterations with one close holding the lock and one parked behind it while `DisposeAsync` runs concurrently; verified red against the original `DisposeAsync` (fails with `ObjectDisposedException`), green after the fix.
|
||
|
||
### Client.Dotnet-004
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `ExecuteSafeUnaryAsync` wraps the whole Polly retry pipeline in a single linked CTS cancelled after `Options.DefaultCallTimeout`, while `CreateCallOptions` also stamps each individual call with a `DefaultCallTimeout` gRPC deadline. The retry pipeline therefore shares one `DefaultCallTimeout` budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe `DefaultCallTimeout` as a per-call timeout, which misrepresents this. `DeadlineExceeded` is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again.
|
||
|
||
**Recommendation:** Decide whether `DefaultCallTimeout` is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on `DeadlineExceeded` when the deadline was client-imposed.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use `DefaultCallTimeout`, and `IsTransientStatus` listed `DeadlineExceeded`. Resolved as a per-operation budget (the simpler, non-breaking choice): the `DefaultCallTimeout` XML doc in `MxGatewayClientOptions.cs` now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed `DeadlineExceeded` from `MxGatewayClientRetryPolicy.IsTransientStatus`: every unary deadline is client-imposed (`CreateCallOptions` stamps the shared budget), so a `DeadlineExceeded` means the budget is exhausted and an immediate retry can only fail again. Regression test `MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded` asserts the safe diagnostic command (`Ping`) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded).
|
||
|
||
### Client.Dotnet-005
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `RegisterAsync`/`AddItemAsync`/`AddItem2Async` return `reply.<Typed>?.ServerHandle ?? reply.ReturnValue.Int32Value`. After `EnsureMxAccessSuccess()` passes, a missing typed payload silently falls back to `ReturnValue.Int32Value`, which for a reply carrying no return value is `0`. A caller then uses `0` as a `ServerHandle`/`ItemHandle`, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error.
|
||
|
||
**Recommendation:** If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive `MxGatewayException`) rather than falling through to `ReturnValue.Int32Value`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source and `mxaccess_gateway.proto`: `register`/`add_item`/`add_item2` are members of the `MxCommandReply.payload` oneof, so the typed accessor is `null` whenever the worker did not set that case — and the fallback returned `ReturnValue.Int32Value` (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: `RegisterAsync`/`AddItemAsync`/`AddItem2Async` throw via a new private `MxGatewaySession.CreateMissingPayloadException` helper that builds a descriptive `MxGatewayException` naming the missing payload, kind, session, and correlation id. Regression tests `MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload` and `AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload` enqueue an `Ok` reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned `0` instead of throwing).
|
||
|
||
### Client.Dotnet-006
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `MxGatewayClientOptions.MaxGrpcMessageBytes` and the two `const`s in `MxGatewayClientContractInfo` are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface.
|
||
|
||
**Recommendation:** Add `<summary>` doc comments to `MaxGrpcMessageBytes`, `GatewayProtocolVersion`, and `WorkerProtocolVersion`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added `<summary>` comments to `MxGatewayClientOptions.MaxGrpcMessageBytes` (describing the 16 MiB default applied to both send and receive limits), and to `MxGatewayClientContractInfo.GatewayProtocolVersion` and `WorkerProtocolVersion` (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean.
|
||
|
||
### Client.Dotnet-007
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The `AcknowledgeAlarmAsync` XML comment states the gateway authenticates against an `invoke:alarm-ack` scope, but `CLAUDE.md` documents the scope set without any `invoke:alarm-ack` sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need.
|
||
|
||
**Recommendation:** Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against the server-side authorization model: `GatewayGrpcScopeResolver.ResolveRequiredScope` has no arm for `AcknowledgeAlarmRequest`, so it falls to the `_ => GatewayScopes.Admin` default — the RPC actually requires the `admin` scope. No `invoke:alarm-ack` sub-scope exists anywhere in `GatewayScopes`. The client XML comment on `AcknowledgeAlarmAsync` was wrong, not the docs. Corrected the comment to state the gateway authorizes `AcknowledgeAlarmRequest` against the API key's `admin` scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed.
|
||
|
||
### Client.Dotnet-008
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The CLI redactor only removes the API key string when it was supplied via `--api-key`; `RunCoreAsync` passes `arguments.GetOptional("api-key")` to `Redact`. When the key comes from an environment variable (`--api-key-env`, the documented default path), `apiKey` is `null` and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted.
|
||
|
||
**Recommendation:** Resolve the effective API key (same logic as `ResolveApiKey`) before redacting, so the env-var-sourced key is also stripped from error output.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewayClientCli.RunCoreAsync`'s catch block redacted only `arguments.GetOptional("api-key")`, so an env-var-sourced key (`--api-key-env`, default `MXGATEWAY_API_KEY`) was never stripped. Note `MxGatewayCliSecretRedactor` itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing `TryResolveApiKey` helper (used by both the existing `ResolveApiKey` and the catch block) that resolves `--api-key` then the `--api-key-env` environment variable; the catch block now redacts that effective key. Updated `clients/dotnet/README.md` (`smoke` paragraph) to state the CLI redacts the effective key whether from `--api-key` or `--api-key-env`. Regression test `MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable` sets a test env var, forces a transport error echoing the key, and asserts the key is absent and `[redacted]` is present; verified red against the original `GetOptional("api-key")`-only redaction (key printed unredacted).
|
||
|
||
### Client.Dotnet-009
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-003 upgraded `MxGatewayClient._disposed` to an `int` accessed via `Interlocked.Exchange` / `Volatile.Read` so a concurrent `ThrowIfDisposed` cannot observe a stale value. The symmetric `GalaxyRepositoryClient._disposed` is still a plain unsynchronised `bool`: `DisposeAsync` reads `if (_disposed)` then writes `_disposed = true` without `Interlocked` or `Volatile`, and `ThrowIfDisposed` does an unsynchronised read. The Galaxy client is publicly `IAsyncDisposable` and exposes `TestConnectionAsync` / `GetLastDeployTimeAsync` / `DiscoverHierarchyAsync` / `WatchDeployEventsAsync` as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency.
|
||
|
||
**Recommendation:** Mirror Client.Dotnet-003 on `GalaxyRepositoryClient`: change `_disposed` to an `int`, use `Interlocked.Exchange(ref _disposed, 1) != 0` in `DisposeAsync`, and `Volatile.Read(ref _disposed) != 0` in `ThrowIfDisposed`. A duplicated `MxGatewaySession`-style close-lock drain is unnecessary because `GalaxyRepositoryClient` does not own a per-call `SemaphoreSlim`.
|
||
|
||
**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryClient._disposed` from `bool` to `int`; `DisposeAsync` now uses `Interlocked.Exchange(ref _disposed, 1) != 0` for the once-only guard and `ThrowIfDisposed` uses `Volatile.Read(ref _disposed) != 0`, mirroring the Client.Dotnet-003 fix on `MxGatewayClient`.
|
||
|
||
### Client.Dotnet-010
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-005 fixed the silent `Register` / `AddItem` / `AddItem2` handle-fallback to `reply.ReturnValue.Int32Value` inside `MxGatewaySession`, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. `BenchReadBulkAsync` (line 638) and `BenchStreamEventsAsync` (line 896) both do `int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;` after a register call, and `SmokeAsync` (lines 1261 and 1279) passes `reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value` and the equivalent `AddItem?.ItemHandle` selector to `InvokeForHandleAsync`. After `EnsureProtocolSuccess` + `EnsureMxAccessSuccess` pass but the worker did not set the typed `register` / `add_item` oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents.
|
||
|
||
**Recommendation:** Either delegate to the SDK helpers (`MxGatewaySession.RegisterAsync` / `AddItemAsync`) which already throw the descriptive `MxGatewayException` via `CreateMissingPayloadException`, or replicate the same null-check explicitly in `InvokeForHandleAsync` and the two bench commands. A unit test that enqueues an `Ok` reply with no typed payload through `FakeCliClient` and asserts the smoke / bench commands fail loudly would prevent regression.
|
||
|
||
**Resolution:** 2026-05-20 — Added private CLI helpers `RequireRegisterServerHandle` and `RequireAddItemItemHandle` (with a shared `CreateMissingPayloadException` mirroring the SDK-level `MxGatewaySession` helper) that throw a descriptive `MxGatewayException` when the typed `register` / `add_item` payload is absent on an otherwise-successful reply. Replaced all four `?? reply.ReturnValue.Int32Value` fallback sites — `BenchReadBulkAsync` (line 638), `BenchStreamEventsAsync` (line 896), and both `SmokeAsync` selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle.
|
||
|
||
### Client.Dotnet-011
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The new `bench-stream-events` command (added in commit `1cd51bb`) supports `--session-count > 1` and runs each session's `StreamEvents` reader in parallel via `openedSessions.Select(RunStreamAsync).ToArray()` then `Task.WhenAll`. Inside the per-session lambda the inner `Task.Run`-spawned event loop updates two shared `DateTime?` fields without synchronisation:
|
||
|
||
```csharp
|
||
if (firstSteadyEventUtc is null)
|
||
{
|
||
firstSteadyEventUtc = nowUtc;
|
||
}
|
||
lastSteadyEventUtc = nowUtc;
|
||
```
|
||
|
||
The integer counters next to them (`steadyEvents`, `steadyDataChangeEvents`, `warmupEvents`) use `Interlocked.Increment`, and the latency list uses an explicit `lock (latencyLock)`, so the rest of the loop is data-race-free — but these two `DateTime?` updates are not. With N parallel sessions a torn read on `firstSteadyEventUtc` produces a non-deterministic "first event time" and the final `steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds` can compute a slightly wrong window. The user-visible impact is bench-only (skewed `eventsPerSecond` / `dataChangeEventsPerSecond` numbers), and on x64 the 64-bit `DateTime` field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop.
|
||
|
||
**Recommendation:** Either guard the two `DateTime?` updates with the existing `latencyLock` (cheapest), use `Interlocked.CompareExchange` for `firstSteadyEventUtc` and `Volatile.Write` for `lastSteadyEventUtc`, or aggregate per-session in local variables and reduce after `Task.WhenAll`. The reduce-after approach also fixes a related issue: today a faster session can stomp `firstSteadyEventUtc` after a slower one already set it.
|
||
|
||
**Resolution:** 2026-05-20 — Guarded the `firstSteadyEventUtc` / `lastSteadyEventUtc` reads and writes inside the per-session event loop with the existing `latencyLock`. `firstSteadyEventUtc` now uses the null-coalescing assignment `firstSteadyEventUtc ??= nowUtc;` under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after `Task.WhenAll` (happens-before applies) and stays lock-free.
|
||
|
||
### Client.Dotnet-012
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `src/Directory.Build.props` enforces `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, and `Deterministic=true` for every gateway / worker / contracts project, and `CLAUDE.md` calls this out as a baseline build property. The .NET client projects live under `clients/dotnet/` and there is no `Directory.Build.props` at `clients/` or `clients/dotnet/` — so none of those properties apply to `MxGateway.Client`, `MxGateway.Client.Cli`, or `MxGateway.Client.Tests`. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The `CSharpStyleGuide.md` baseline ("Treat compiler warnings as actionable") and the `CLAUDE.md` table under "Source Update Workflow" both apply equally to `.NET client` ("`dotnet build clients/dotnet/MxGateway.Client.sln`"), but the enforcement floor is missing.
|
||
|
||
**Recommendation:** Add `clients/dotnet/Directory.Build.props` (or `clients/Directory.Build.props` covering Rust-Cargo siblings is N/A — only `clients/dotnet/`) carrying the same property set: `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, `Deterministic=true`. Excluding generated code (which already lives under `src/MxGateway.Contracts/Generated`) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in.
|
||
|
||
**Resolution:** 2026-05-20 — Added `clients/dotnet/Directory.Build.props` mirroring `src/Directory.Build.props`: `LangVersion=latest`, `Nullable=enable`, `ImplicitUsings=enable`, `TreatWarningsAsErrors=true`, `AnalysisLevel=latest`, `EnforceCodeStyleInBuild=true`, `Deterministic=true`. The three client `.csproj` files inherit from it automatically. Re-ran `dotnet build clients/dotnet/MxGateway.Client.sln` and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated.
|
||
|
||
### Client.Dotnet-013
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit:
|
||
|
||
- `DiscoverHierarchyOptions` (the public record) has no `<summary>` on the type and no XML doc on any of its ten public properties (`RootGobjectId`, `RootTagName`, `RootContainedPath`, `MaxDepth`, `CategoryIds`, `TemplateChainContains`, `TagNameGlob`, `IncludeAttributes`, `AlarmBearingOnly`, `HistorizedOnly`).
|
||
- The second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload on `GalaxyRepositoryClient` is `public` with no XML doc, while the parameterless overload one method above it carries a full `<summary>` / `<param>` block.
|
||
- `IMxGatewayCliClient` is a public interface in the CLI project with no `<summary>` on the type (the member docs are present).
|
||
|
||
This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation.
|
||
|
||
**Recommendation:** Add `<summary>` docs to each undocumented member. For `DiscoverHierarchyOptions`, the property names map cleanly to the underlying `DiscoverHierarchyRequest` proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves.
|
||
|
||
**Resolution:** 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on `DiscoverHierarchyOptions` (ten properties, mapped to the underlying `DiscoverHierarchyRequest` proto fields and noting the root-precedence rule); a `<summary>`/`<param>`/`<returns>` block on the second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload describing its filter semantics and transparent pagination; and a type-level `<summary>` on the public `IMxGatewayCliClient` interface explaining its CLI-only transport role and the production binding.
|
||
|
||
### Client.Dotnet-014
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Location | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-002 closed a coverage gap where the production retry path (`RpcException` → `MxGatewayException` mapping by `RpcExceptionMapper.Map`) was never exercised, by adding a `MapTransportExceptions` flag to `FakeGatewayTransport` and a regression test that runs through the wrapped-exception branch. That flag is wired through `Translate(...)` in `OpenSessionAsync` / `CloseSessionAsync` / `InvokeAsync`, but the new alarm test path is not: `FakeGatewayTransport.AcknowledgeAlarmAsync` throws the queued exception verbatim (line 219), bypassing `Translate`. The accompanying `MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts `Assert.ThrowsAsync<RpcException>` — but the production path through `GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync` (lines 120-134) already calls `RpcExceptionMapper.Map`, so production callers see `MxGatewayAuthenticationException` and not `RpcException`. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from `MxGatewayClient.AcknowledgeAlarmAsync` cannot regress on the alarm-ack mapping without somebody noticing.
|
||
|
||
**Recommendation:** Either route `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use and add a regression test that enables `MapTransportExceptions = true` and asserts `MxGatewayAuthenticationException`; or rename the existing test to make the pass-through shape explicit (e.g. `…_SurfacesRpcExceptionFromFakeTransportVerbatim`) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||
|
||
**Resolution:** 2026-05-20 — Applied both halves of the recommendation. Routed `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use, so when `MapTransportExceptions = true` thrown `RpcException`s now run through the production `RpcExceptionMapper.Map`. Renamed the existing pass-through test to `AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled` (with an updated comment pinning that this shape only applies when mapping is off), and added a new test `AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` that enables mapping and asserts the production-parity `MxGatewayAuthenticationException` with `StatusCode.Unauthenticated`. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||
|
||
### Client.Dotnet-015
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `CreateCancellation(arguments, command)` calls `cancellation.CancelAfter(timeout)` for every command except the explicitly long-running `galaxy-watch`, where `timeout` is `arguments.GetDuration("timeout", TimeSpan.FromSeconds(30))`. That same `--timeout` value is also fed into `CreateOptions` as `DefaultCallTimeout`, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both `bench-read-bulk` and `bench-stream-events` (introduced in `7db4bff` and `1cd51bb`) default to `--duration-seconds=30 --warmup-seconds=3`, which already exceeds the 30 s wall-clock budget; `bench-stream-events --session-count=N` adds another `750 ms × (N-1)` of `sessionStartStaggerMs` before the measurement window even opens.
|
||
|
||
A manual invocation such as `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ...` therefore cancels mid-window every time: the outer `CancellationTokenSource` trips at 30 s and the bench's inner `await Task.Delay(steadyEnd - warmupStart, cancellationToken)` throws an `OperationCanceledException` before `firstSteadyEventUtc`/`lastSteadyEventUtc` are even populated, producing a zero `steadyElapsedSeconds` / `0 eventsPerSecond` JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation.
|
||
|
||
`scripts/bench-read-bulk.ps1` already works around this for `bench-read-bulk` by computing `$callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30)` and passing `--timeout ${callTimeoutSeconds}s` (line 125), so the driver flow is correct. But there is no PowerShell wrapper for `bench-stream-events`, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers.
|
||
|
||
**Recommendation:** Either (a) extend the `isLongRunning` set in `CreateCancellation` to include `bench-read-bulk` and `bench-stream-events`, so manual invocation defers to caller-supplied `--timeout` and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor `--timeout` for the bench commands from `duration-seconds + warmup-seconds + headroom` the way the PS driver does; or (c) split the `--timeout` knob into a distinct per-call `--call-timeout` and outer `--wall-clock-timeout` and document the two roles. Option (a) is the smallest change and matches the existing `galaxy-watch` precedent. Add a CLI test that runs `bench-read-bulk` with `--duration-seconds=2 --warmup-seconds=0 --timeout=1s` and asserts the bench either errors loudly or completes (today it silently emits zeros).
|
||
|
||
**Resolution:** 2026-05-20 — Applied option (a): extended the `isLongRunning` set in `CreateCancellation` from `command is "galaxy-watch"` to `command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events"`, so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies `--timeout`. A caller-supplied `--timeout` still flows through to `DefaultCallTimeout` for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing `galaxy-watch` precedent and removes the silent zero-throughput failure mode without breaking the `scripts/bench-read-bulk.ps1` driver path (which explicitly raises `--timeout`).
|
||
|
||
### Client.Dotnet-016
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `BenchStreamEventsAsync.RunStreamAsync` launches the per-session stream reader inside a `Task.Run(async () => { ... }, streamCts.Token)` and stores the returned task in the local `streamTask`. The recovery block
|
||
|
||
```csharp
|
||
await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false);
|
||
streamCts.Cancel();
|
||
try { await streamTask.ConfigureAwait(false); }
|
||
catch (OperationCanceledException) { }
|
||
catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { }
|
||
```
|
||
|
||
only awaits `streamTask` (and therefore only observes its exception) when `Task.Delay` returns normally. When the outer `cancellationToken` cancels during the delay — exactly the case Client.Dotnet-015 makes likely — `Task.Delay` throws `OperationCanceledException` and skips both `streamCts.Cancel()` and the `await streamTask`. The inner stream task is still alive at that point. The `using CancellationTokenSource streamCts = ...` on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting `OperationCanceledException` / mapped `MxGatewayException` is never observed. The local `streamTask` reference is dropped as `RunStreamAsync` unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a `TaskScheduler.UnobservedTaskException`.
|
||
|
||
The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this code path: the production `GrpcMxGatewayClientTransport.StreamEventsAsync` always wraps `RpcException` via `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled` (mapper line 31). So the inner task's cancellation exception is always `OperationCanceledException`, not `RpcException`. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case.
|
||
|
||
**Recommendation:** Restructure `RunStreamAsync` so the inner `streamTask` is always observed. A `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }` shape works (the `finally` runs even on outer cancellation). Alternatively, hoist `streamTask` into a local that the outer method's `try`/`finally` always awaits before exiting, so the per-session loop becomes `await Task.WhenAny(streamTask, Task.Delay(...))` then a guaranteed `await streamTask`. Drop the now-redundant `Grpc.Core.RpcException` catch or convert it to catch `MxGatewayException` for the wrapped shape (and document that it should never fire in production).
|
||
|
||
**Resolution:** 2026-05-20 — Restructured `RunStreamAsync` to wrap the `Task.Delay` in `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }`, so the inner stream task is observed on every path — including when the outer `cancellationToken` cancels during the delay. Dropped the dead `catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled)` clause (the production `GrpcMxGatewayClientTransport.StreamEventsAsync` routes through `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled`, so an `RpcException` never reaches here) and replaced it with `catch (MxGatewayException)` to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent `TaskScheduler.UnobservedTaskException` whenever the outer cancellation fires mid-measurement-window.
|
||
|
||
### Client.Dotnet-017
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Surfaced during the 2026-05-20 cross-language e2e matrix run: `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- stream-events --endpoint http://localhost:5120 --api-key-env MXGATEWAY_API_KEY --timeout 60s --json --session-id session-... --max-events 200` exited with `-532462766` (unhandled-exception exit code) and propagated `System.OperationCanceledException: Call canceled by the client.` mapped from `Status(StatusCode="Cancelled", …)`. The CLI's `StreamEventsAsync` does `await foreach (... in client.StreamEventsAsync(...).WithCancellation(cancellationToken))` and never catches `OperationCanceledException`. When the caller's `--timeout` (driven by `CreateCancellation`'s `CancelAfter`) fires before `--max-events` is reached — the common case for a finite-window event collector against a quiet test rig — the foreach throws, the exception bubbles up, the process exits non-zero, and any `--json` aggregate output is never written. The other client CLIs (Go, Rust, Python, Java) all exit 0 in this case (e2e clients g/r/p ran clean). The bug is also a strict regression of the CLI's contract: callers can't tell "stream collected 0–N events then the budget closed" apart from "the call genuinely failed".
|
||
|
||
**Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.
|
||
|
||
**Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.
|
||
|
||
#### 2026-05-24 re-review (commit 42b0037)
|
||
|
||
Re-review pass at `42b0037`. Diff against `d692232` consists of four commits:
|
||
|
||
- `71d2c39` — ports the `batch` subcommand to all five client CLIs (the .NET
|
||
half adds `RunBatchAsync`, the `__MXGW_BATCH_EOR__` sentinel, the
|
||
`forceJsonErrors` plumbing on `RunCoreAsync`, and the two
|
||
`RunAsync_Batch_*` tests).
|
||
- `b3ae200` — ports the bulk read/write SDK methods + matching CLI
|
||
subcommands (the SDK side adds `WriteBulkAsync` / `Write2BulkAsync` /
|
||
`WriteSecuredBulkAsync` / `WriteSecured2BulkAsync` / `ReadBulkAsync` on
|
||
`MxGatewaySession`; the CLI side adds `read-bulk`, `write-bulk`,
|
||
`write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and the
|
||
cross-language stress harness `bench-read-bulk`).
|
||
- `11cc671` — ports `stream-alarms` and `acknowledge-alarm` to the CLI and
|
||
the SDK transport (`StreamAlarmsAsync` on `MxGatewayClient`,
|
||
`IMxGatewayClientTransport`, `GrpcMxGatewayClientTransport`,
|
||
`IMxGatewayCliClient`, `MxGatewayCliClientAdapter`, and
|
||
`FakeGatewayTransport`; two new `MxGatewayClientCliTests` covering
|
||
acknowledge + stream-alarms with `--max-events` and payload-case
|
||
distinction). The same commit wraps `StreamEventsAsync`'s `await foreach`
|
||
with `catch (OperationCanceledException) when
|
||
(cancellationToken.IsCancellationRequested)` (matching the Client.Dotnet-017
|
||
resolution shape) and mirrors the catch on the new `StreamAlarmsAsync`
|
||
CLI handler. `RunBatchAsync`'s outer catch is `catch (Exception
|
||
exception)` with no `when` guard — OCE thrown out of `RunCoreAsync` is
|
||
caught here and written as a JSON error so the batch process keeps
|
||
going, while the inner `RunCoreAsync` catch retains its `when
|
||
(exception is not OperationCanceledException)` guard so the non-batch
|
||
CLI behaviour is unchanged.
|
||
- `8738735` — updates each client's README with `stream-alarms` /
|
||
`acknowledge-alarm` usage and SDK examples.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021). |
|
||
| 2 | mxaccessgw conventions | Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018). |
|
||
| 3 | Concurrency & thread safety | No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`. |
|
||
| 4 | Error handling & resilience | Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly. |
|
||
| 5 | Security | No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths. |
|
||
| 6 | Performance & resource management | No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category). |
|
||
| 7 | Design-document adherence | No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less. |
|
||
| 8 | Code organization & conventions | No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern. |
|
||
| 9 | Testing coverage | No new issues — `RunAsync_StreamAlarms_*`, `RunAsync_AcknowledgeAlarm_*`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here. |
|
||
| 10 | Documentation & comments | Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete. |
|
||
|
||
### Client.Dotnet-018
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Documentation & comments |
|
||
| Location | `clients/dotnet/README.md:137-138` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows:
|
||
|
||
```
|
||
mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json
|
||
mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
|
||
```
|
||
|
||
None of these flags exist. The actual CLI consumes `--filter-prefix`, `--max-events`, `[--jsonl]` for `stream-alarms` and `--reference`, `--comment`, `--operator`, `--json` for `acknowledge-alarm` (see `WriteUsage` lines 891-892 and the `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` handlers in `MxGatewayClientCli.cs`). Neither command takes a `--session-id` — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on `MxGatewayClient.StreamAlarmsAsync` lines 226-237). A user who copies the README example will get a `--session-id` "unknown option" failure on both commands and an "Argument 'reference' is required" failure on `acknowledge-alarm` (since `--alarm-reference` is silently treated as `--alarm-reference` rather than `--reference`).
|
||
|
||
This is exactly the docs-in-the-same-change rule from `CLAUDE.md` ("Update docs in the same change as the source") — the README change in `8738735` documents the wrong surface for the source change in `11cc671`. The built-in `WriteUsage` text is correct; only the README walkthrough is wrong.
|
||
|
||
**Recommendation:** Replace the two example lines with the actual flag names:
|
||
|
||
```
|
||
mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json
|
||
mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json
|
||
```
|
||
|
||
A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.
|
||
|
||
**Resolution:** 2026-05-24 — Confirmed against source: the README `dotnet run -- stream-alarms` example used `--session-id <id> --max-messages 1`, neither of which the CLI accepts (the production parser routes through `--filter-prefix` / `--max-events`), and `acknowledge-alarm` used `--session-id <id> --alarm-reference <ref>` against a session-less central alarm monitor that actually consumes `--reference`. Replaced both README example lines with the parser-correct shape: `stream-alarms --filter-prefix Area001 --max-events 1 --json` and `acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json`. Regression test `MxGatewayClientCliTests.RunAsync_ReadmeExamples_ForAlarmCommands_ParseSuccessfully` (xUnit `[Theory]` over `stream-alarms` and `acknowledge-alarm`) locates `clients/dotnet/README.md`, extracts the documented CLI command line for each subcommand, tokenizes it (preserving quoted segments), and drives it through the production `MxGatewayClientCli.RunAsync` against a `FakeCliClient`; the test asserts exit code 0 and that stderr contains neither "Unknown command" nor "Missing required option". Verified red against the original README text (both theory rows failed with exit code 1) and green after the README update, so any future documentation drift between README examples and the actual parser shape is caught at test time.
|
||
|
||
### Client.Dotnet-019
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745:
|
||
|
||
```csharp
|
||
int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;
|
||
```
|
||
|
||
The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`, steady-state `ReadBulk`, `UnsubscribeBulk` — against a zero `ServerHandle`. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading `cachedReadResults = 0` summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level `RequireRegisterServerHandle` / `RequireAddItemItemHandle` helper was introduced, but neither helper currently exists in the codebase, and the SDK-level `MxGatewaySession.RegisterAsync` / `AddItemAsync` / `AddItem2Async` at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance.
|
||
|
||
**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection.
|
||
|
||
**Resolution:** 2026-05-24 — Confirmed against source: `BenchReadBulkAsync` at line 745 carried the silent fallback `registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value`, so a successful protocol+MX-status reply missing its typed `register` payload would yield ServerHandle=0 and the bench would drive the rest of SubscribeBulk / warmup / steady-state ReadBulk / UnsubscribeBulk against an invalid handle. Added a private `RequireRegisterServerHandle(MxCommandReply reply, string sessionId)` helper that throws a descriptive `MxGatewayException` naming the missing typed payload, the session id, and the correlation id, then replaced the fallback site with `int serverHandle = RequireRegisterServerHandle(registerReply, sessionId);`. The bench's outer `RunCoreAsync` catch then surfaces the throw as exit code 1 plus the descriptive message on stderr, so the failure mode is loud rather than a wall of zero-result stats. Regression test `MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenRegisterReplyMissingTypedPayload_FailsLoudly` enqueues an `Ok` MX command reply with no typed `Register` payload, runs `bench-read-bulk` against the fake, and asserts exit code 1, that stderr names the `Register` payload, and that no `bench-read-bulk` stats JSON was emitted on stdout. Verified red against the original fallback (an NRE bubbled up later in the run rather than the descriptive throw) and green after the helper landed.
|
||
|
||
### Client.Dotnet-020
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`:
|
||
|
||
```csharp
|
||
try
|
||
{
|
||
reply = await client.InvokeAsync(
|
||
CreateCommandRequest(sessionId, readBulkMxCommand),
|
||
cancellationToken)
|
||
.ConfigureAwait(false);
|
||
sw.Stop();
|
||
}
|
||
catch
|
||
{
|
||
sw.Stop();
|
||
failedCalls++;
|
||
latencyMillis.Add(sw.Elapsed.TotalMilliseconds);
|
||
continue;
|
||
}
|
||
```
|
||
|
||
The `catch` has no type filter and no `when (!cancellationToken.IsCancellationRequested)` clause, so `OperationCanceledException` flowing from a cancelled `cancellationToken` (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until `DateTime.UtcNow` reaches `steadyDeadline`, with each iteration immediately throwing OCE from the next `InvokeAsync` call — a tight CPU-burn until the configured `--duration-seconds` elapses, producing a per-iteration `latencyMillis.Add(sw.Elapsed.TotalMilliseconds)` flood that also skews the final p99/max numbers down.
|
||
|
||
The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the `finally` block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on `cancellationToken.IsCancellationRequested`. Note that Client.Dotnet-015's resolution removed the outer `CancelAfter(timeout)` for `bench-read-bulk` / `bench-stream-events`, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit.
|
||
|
||
**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture.
|
||
|
||
**Resolution:** 2026-05-24 — Confirmed against source: the steady-state loop wrapped `client.InvokeAsync` in a bare `catch { sw.Stop(); failedCalls++; latencyMillis.Add(...); continue; }` with no type filter, so an `OperationCanceledException` thrown from a cancelled token (Ctrl+C, parent CTS, or the wall-clock budget) was swallowed and the loop spun until `DateTime.UtcNow >= steadyDeadline`. Replaced the bare clause with `catch (Exception ex) when (ex is not OperationCanceledException)`, so OCE now propagates out of the bench and unwinds through the outer `RunCoreAsync` (which intentionally lets OCE escape). Inline comment names the finding and the reason the filter exists. Regression test `MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenSteadyStateLoopReceivesCancellation_ExitsPromptly` configures the fake CLI client so Register and SubscribeBulk succeed, the first three ReadBulk calls succeed (the loop enters the steady-state body), then every subsequent ReadBulk throws `OperationCanceledException`; the test runs `bench-read-bulk --duration-seconds 30` and asserts the call throws OCE and wall-clock elapsed time stays under 10 s. Verified red against the original bare `catch` (the test ran for the full 30 s without throwing) and green after the filter landed (the bench exited promptly with OCE).
|
||
|
||
### Client.Dotnet-021
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking:
|
||
|
||
```csharp
|
||
// ReadBulkAsync (line 487)
|
||
TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0),
|
||
|
||
// BenchReadBulkAsync (line 715)
|
||
uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500);
|
||
```
|
||
|
||
`arguments.GetInt32(...)` returns `Int32.Parse(...)`, so a caller supplying `--timeout-ms=-1` (an easy mistake for "no timeout" or copy-paste from another tool that uses `-1` for unbounded) silently wraps to `0xFFFFFFFF` = ~49.7 days as the worker-side `ReadBulk` timeout. The gateway forwards that as the per-tag `timeout_ms` on `ReadBulkCommand`, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's `uint32 timeout_ms` field was meant to prevent. The `ReadBulkAsync` SDK overload on `MxGatewaySession` (line 663) already guards correctly: `timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue)`; the CLI handlers should match.
|
||
|
||
**Recommendation:** Wrap the cast in a guard, e.g.:
|
||
|
||
```csharp
|
||
int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0);
|
||
if (timeoutMsRaw < 0)
|
||
{
|
||
throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default).");
|
||
}
|
||
uint timeoutMs = (uint)timeoutMsRaw;
|
||
```
|
||
|
||
A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.
|
||
|
||
**Resolution:** 2026-05-24 — Confirmed against source: both `ReadBulkAsync` (line 490) and `BenchReadBulkAsync` (line 715) cast `arguments.GetInt32("timeout-ms", ...)` straight to `uint`, so `--timeout-ms -1` silently wrapped to `0xFFFFFFFF` (~49.7 days). Added a single shared private helper `ParseTimeoutMs(CliArguments arguments, int defaultValue)` on `MxGatewayClientCli` that reads the int32, rejects negatives with a clear `ArgumentException` ("--timeout-ms must be a non-negative integer (use 0 for the gateway default)."), and returns the safe `(uint)`. Both call sites now route through the helper. Regression test `MxGatewayClientCliTests.RunAsync_TimeoutMs_NegativeValue_RejectsWithClearError` (xUnit `[Theory]` over `read-bulk` and `bench-read-bulk`) drives the CLI with `--timeout-ms -1` and asserts the exit code is non-zero, that stderr contains "timeout-ms", and that the "non-negative" guard text is present. Verified red against the original `(uint)arguments.GetInt32(...)` casts (the bench proceeded past the timeout parse and tripped a downstream "Queue empty" error rather than the descriptive guard message) and green after the helper landed.
|