b9ef09d26e
Append 4 new findings (Client.Dotnet-018..021): README flags for the new stream-alarms/acknowledge-alarm subcommands cite options that do not exist on the CLI; BenchReadBulkAsync reinstates the silent register-handle fallback and swallows OperationCanceledException; both new --timeout-ms consumers cast int32 to uint without bounds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
502 lines
57 KiB
Markdown
502 lines
57 KiB
Markdown
# Code Review — Client.Dotnet
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Module | `clients/dotnet` |
|
||
| Reviewer | Claude Code |
|
||
| Review date | 2026-05-24 |
|
||
| Commit reviewed | `42b0037` |
|
||
| Status | Re-reviewed |
|
||
| Open findings | 4 |
|
||
|
||
## Checklist coverage
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issue found (this review): the global CLI `--timeout` defaults to 30 s and is used both as the gRPC `DefaultCallTimeout` and as the outer `CancelAfter` budget — but `bench-read-bulk` / `bench-stream-events` default to `--duration-seconds=30 --warmup-seconds=3 (+ stagger)`, so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The `scripts/bench-read-bulk.ps1` driver works around this by raising `--timeout`, but `bench-stream-events` has no driver script. |
|
||
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. The new `clients/dotnet/Directory.Build.props` mirrors `src/Directory.Build.props` exactly (same six properties, identical values) so the enforcement floor is back in scope. |
|
||
| 3 | Concurrency & thread safety | Issue found (this review): `BenchStreamEventsAsync`'s per-session `RunStreamAsync` hands the inner `Task.Run` stream loop a reference (`streamTask`) that becomes unobserved whenever the outer `cancellationToken` cancels during the bench's `await Task.Delay` — the `await streamTask` recovery path never runs, so any inner OCE / `RpcException` raised after cancellation surfaces as a `TaskScheduler.UnobservedTaskException` (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied. |
|
||
| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; `RpcExceptionMapper` is consistently called from both gateway and Galaxy transports incl. `AcknowledgeAlarmAsync` after Client.Dotnet-014). |
|
||
| 5 | Security | Good — API key never logged by the library, CLI redacts effective key (both `--api-key` and `--api-key-env` sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged. |
|
||
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op. |
|
||
| 7 | Design-document adherence | No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`. |
|
||
| 8 | Code organization & conventions | No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on `DiscoverHierarchyOptions`, the second `DiscoverHierarchyAsync` overload, and `IMxGatewayCliClient`) are both fully resolved; the new props file is a faithful mirror of the production one. |
|
||
| 9 | Testing coverage | No new issues — Client.Dotnet-014 closed the alarm-side `Translate` gap. The new bench paths (`bench-read-bulk`, `bench-stream-events`) have no unit-test coverage, but they are stress harnesses driven by `scripts/bench-read-bulk.ps1`, not SDK API surface, so this is not flagged. |
|
||
| 10 | Documentation & comments | No new issues this review (Client.Dotnet-007's alarm-ack `admin`-scope correction holds; `DefaultCallTimeout` doc accurately reflects the shared-budget semantics from Client.Dotnet-004). |
|
||
|
||
### 2026-05-24 review (commit d692232)
|
||
|
||
Re-review pass at `d692232`. Diff against `a020350` consists of the `ZB.MOM.WW`
|
||
client prefix rename in commit `397d3c5` (folders, csprojs, sln→slnx, every
|
||
namespace and using) plus the hand-written `DiscoverHierarchyOptions.cs` POCO
|
||
and the dropped retired `SessionId =` lines from alarm-related test fixtures.
|
||
The rename was applied via a case-insensitive regex sweep; no over-rename
|
||
artifacts found. The `mxgw_*` API-key wire prefix, `MXGATEWAY_*` environment
|
||
variables, and the `MxGatewayClient` / `MxGatewaySession` type names are
|
||
unchanged. Build and tests are green at HEAD.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. |
|
||
| 2 | mxaccessgw conventions | No issues found — rename hygiene clean; wire identifiers preserved. |
|
||
| 3 | Concurrency & thread safety | No issues found in this diff. |
|
||
| 4 | Error handling & resilience | No issues found in this diff. |
|
||
| 5 | Security | No issues found in this diff. |
|
||
| 6 | Performance & resource management | No issues found in this diff. |
|
||
| 7 | Design-document adherence | No issues found in this diff — `DotnetClientDesign.md` reflects the new layout. |
|
||
| 8 | Code organization & conventions | No issues found in this diff. |
|
||
| 9 | Testing coverage | No issues found in this diff — `MxGatewayClientAlarmsTests` fixtures correctly drop `SessionId` from `AcknowledgeAlarmRequest`/`Reply` and retain it on `QueryActiveAlarmsRequest`. |
|
||
| 10 | Documentation & comments | No issues found in this diff. |
|
||
|
||
## Findings
|
||
|
||
### Client.Dotnet-001
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `MapRpcException` only produces typed exceptions for `Unauthenticated` and `PermissionDenied`. Every other gRPC status — `NotFound`, `InvalidArgument`, `ResourceExhausted`, `FailedPrecondition`, `Unavailable`, `Internal` — collapses into the base `MxGatewayException` with no surfaced `StatusCode`. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into `InnerException` and downcasting to `RpcException`.
|
||
|
||
**Recommendation:** Carry the gRPC `StatusCode` on `MxGatewayException` (e.g. a `StatusCode` property) and/or add typed subclasses for at least `NotFound`, `InvalidArgument`, and `Unavailable`. Populate it from `exception.StatusCode` in `MapRpcException`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: both transports had a duplicated private `MapRpcException` that only typed two statuses and discarded the gRPC code for the rest. Added a nullable `StatusCode` property (`Grpc.Core.StatusCode?`) to `MxGatewayException` plus constructors that carry it, threaded it through `MxGatewayAuthenticationException`/`MxGatewayAuthorizationException`, and extracted the two duplicated mappers into a single shared internal `RpcExceptionMapper` (`RpcExceptionMapper.cs`) that populates `StatusCode` from `exception.StatusCode` for every status. Callers can now distinguish transient from permanent failures without downcasting `InnerException`. Documented in `clients/dotnet/README.md`. Regression test: `RpcExceptionMapperTests` (8 cases incl. the `[Theory]` over `NotFound`/`InvalidArgument`/`ResourceExhausted`/`FailedPrecondition`/`Unavailable`/`Internal`).
|
||
|
||
### Client.Dotnet-002
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Testing coverage |
|
||
| Location | `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The retry predicate `MxGatewayClientRetryPolicy.IsTransientGrpcFailure` handles two shapes: a raw `RpcException` and an `MxGatewayException { InnerException: RpcException }`. In production the transport always maps `RpcException` → `MxGatewayException` before it reaches the retry pipeline, so only the wrapped-`MxGatewayException` branch ever runs in production. But `FakeGatewayTransport` throws the raw `RpcException` and never maps it, so every retry test exercises only the raw-`RpcException` branch — the branch that never occurs in production. The production retry behaviour is effectively untested.
|
||
|
||
**Recommendation:** Add a fake/transport mode that maps `RpcException` to `MxGatewayException` the way `GrpcMxGatewayClientTransport` does (or add tests that enqueue a pre-wrapped `MxGatewayException`), so the actually-used predicate branch is covered.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `FakeGatewayTransport` threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-`RpcException` predicate branch. Added a `MapTransportExceptions` flag to `FakeGatewayTransport` that, when set, runs thrown `RpcException`s through the same shared `RpcExceptionMapper` the production gRPC transport uses, producing the wrapped `MxGatewayException` shape. Added regression test `MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException`, which exercises the previously-untested production predicate branch. Verified red: removing the `MxGatewayException { InnerException: RpcException }` case from `IsTransientGrpcFailure` fails the new test while the pre-existing raw-`RpcException` test still passes.
|
||
|
||
### Client.Dotnet-003
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `DisposeAsync` calls `CloseAsync()` (no token) then unconditionally `_closeLock.Dispose()`. If another thread is concurrently awaiting `CloseAsync(token)` — legal, since the type exposes public async methods and no single-threaded contract — disposing the `SemaphoreSlim` while a `WaitAsync` is pending throws `ObjectDisposedException` into that caller. The `_disposed` flags in both clients are also plain unsynchronised `bool` reads/writes; `ThrowIfDisposed` racing `DisposeAsync` can observe a stale value.
|
||
|
||
**Recommendation:** Either document `MxGatewaySession`/`MxGatewayClient` as not thread-safe for concurrent dispose, or guard `_disposed` with `Interlocked`/`volatile` and avoid disposing `_closeLock` until all in-flight `CloseAsync` calls complete.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewaySession.DisposeAsync` disposed `_closeLock` unconditionally, racing concurrent `CloseAsync` callers; `MxGatewayClient._disposed` was a plain `bool`. Fixed `MxGatewaySession` by tracking in-flight `CloseAsync` callers with an `_activeCloseCount` guarded by a dedicated `_disposeGate` lock and a `_closeLockDisposed` flag: `CloseAsync` registers under the gate (and throws `ObjectDisposedException` if disposal already won) before awaiting `_closeLock.WaitAsync`, and `DisposeAsync` drains `_activeCloseCount` to zero before disposing the semaphore, so the close lock provably outlives every pending `WaitAsync`. Fixed `MxGatewayClient` by changing `_disposed` to an `int` accessed via `Interlocked.Exchange`/`Volatile.Read`. Regression test `MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync` runs 100 iterations with one close holding the lock and one parked behind it while `DisposeAsync` runs concurrently; verified red against the original `DisposeAsync` (fails with `ObjectDisposedException`), green after the fix.
|
||
|
||
### Client.Dotnet-004
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `ExecuteSafeUnaryAsync` wraps the whole Polly retry pipeline in a single linked CTS cancelled after `Options.DefaultCallTimeout`, while `CreateCallOptions` also stamps each individual call with a `DefaultCallTimeout` gRPC deadline. The retry pipeline therefore shares one `DefaultCallTimeout` budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe `DefaultCallTimeout` as a per-call timeout, which misrepresents this. `DeadlineExceeded` is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again.
|
||
|
||
**Recommendation:** Decide whether `DefaultCallTimeout` is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on `DeadlineExceeded` when the deadline was client-imposed.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use `DefaultCallTimeout`, and `IsTransientStatus` listed `DeadlineExceeded`. Resolved as a per-operation budget (the simpler, non-breaking choice): the `DefaultCallTimeout` XML doc in `MxGatewayClientOptions.cs` now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed `DeadlineExceeded` from `MxGatewayClientRetryPolicy.IsTransientStatus`: every unary deadline is client-imposed (`CreateCallOptions` stamps the shared budget), so a `DeadlineExceeded` means the budget is exhausted and an immediate retry can only fail again. Regression test `MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded` asserts the safe diagnostic command (`Ping`) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded).
|
||
|
||
### Client.Dotnet-005
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `RegisterAsync`/`AddItemAsync`/`AddItem2Async` return `reply.<Typed>?.ServerHandle ?? reply.ReturnValue.Int32Value`. After `EnsureMxAccessSuccess()` passes, a missing typed payload silently falls back to `ReturnValue.Int32Value`, which for a reply carrying no return value is `0`. A caller then uses `0` as a `ServerHandle`/`ItemHandle`, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error.
|
||
|
||
**Recommendation:** If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive `MxGatewayException`) rather than falling through to `ReturnValue.Int32Value`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source and `mxaccess_gateway.proto`: `register`/`add_item`/`add_item2` are members of the `MxCommandReply.payload` oneof, so the typed accessor is `null` whenever the worker did not set that case — and the fallback returned `ReturnValue.Int32Value` (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: `RegisterAsync`/`AddItemAsync`/`AddItem2Async` throw via a new private `MxGatewaySession.CreateMissingPayloadException` helper that builds a descriptive `MxGatewayException` naming the missing payload, kind, session, and correlation id. Regression tests `MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload` and `AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload` enqueue an `Ok` reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned `0` instead of throwing).
|
||
|
||
### Client.Dotnet-006
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `MxGatewayClientOptions.MaxGrpcMessageBytes` and the two `const`s in `MxGatewayClientContractInfo` are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface.
|
||
|
||
**Recommendation:** Add `<summary>` doc comments to `MaxGrpcMessageBytes`, `GatewayProtocolVersion`, and `WorkerProtocolVersion`.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added `<summary>` comments to `MxGatewayClientOptions.MaxGrpcMessageBytes` (describing the 16 MiB default applied to both send and receive limits), and to `MxGatewayClientContractInfo.GatewayProtocolVersion` and `WorkerProtocolVersion` (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean.
|
||
|
||
### Client.Dotnet-007
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Documentation & comments |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The `AcknowledgeAlarmAsync` XML comment states the gateway authenticates against an `invoke:alarm-ack` scope, but `CLAUDE.md` documents the scope set without any `invoke:alarm-ack` sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need.
|
||
|
||
**Recommendation:** Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against the server-side authorization model: `GatewayGrpcScopeResolver.ResolveRequiredScope` has no arm for `AcknowledgeAlarmRequest`, so it falls to the `_ => GatewayScopes.Admin` default — the RPC actually requires the `admin` scope. No `invoke:alarm-ack` sub-scope exists anywhere in `GatewayScopes`. The client XML comment on `AcknowledgeAlarmAsync` was wrong, not the docs. Corrected the comment to state the gateway authorizes `AcknowledgeAlarmRequest` against the API key's `admin` scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed.
|
||
|
||
### Client.Dotnet-008
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The CLI redactor only removes the API key string when it was supplied via `--api-key`; `RunCoreAsync` passes `arguments.GetOptional("api-key")` to `Redact`. When the key comes from an environment variable (`--api-key-env`, the documented default path), `apiKey` is `null` and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted.
|
||
|
||
**Recommendation:** Resolve the effective API key (same logic as `ResolveApiKey`) before redacting, so the env-var-sourced key is also stripped from error output.
|
||
|
||
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewayClientCli.RunCoreAsync`'s catch block redacted only `arguments.GetOptional("api-key")`, so an env-var-sourced key (`--api-key-env`, default `MXGATEWAY_API_KEY`) was never stripped. Note `MxGatewayCliSecretRedactor` itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing `TryResolveApiKey` helper (used by both the existing `ResolveApiKey` and the catch block) that resolves `--api-key` then the `--api-key-env` environment variable; the catch block now redacts that effective key. Updated `clients/dotnet/README.md` (`smoke` paragraph) to state the CLI redacts the effective key whether from `--api-key` or `--api-key-env`. Regression test `MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable` sets a test env var, forces a transport error echoing the key, and asserts the key is absent and `[redacted]` is present; verified red against the original `GetOptional("api-key")`-only redaction (key printed unredacted).
|
||
|
||
### Client.Dotnet-009
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-003 upgraded `MxGatewayClient._disposed` to an `int` accessed via `Interlocked.Exchange` / `Volatile.Read` so a concurrent `ThrowIfDisposed` cannot observe a stale value. The symmetric `GalaxyRepositoryClient._disposed` is still a plain unsynchronised `bool`: `DisposeAsync` reads `if (_disposed)` then writes `_disposed = true` without `Interlocked` or `Volatile`, and `ThrowIfDisposed` does an unsynchronised read. The Galaxy client is publicly `IAsyncDisposable` and exposes `TestConnectionAsync` / `GetLastDeployTimeAsync` / `DiscoverHierarchyAsync` / `WatchDeployEventsAsync` as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency.
|
||
|
||
**Recommendation:** Mirror Client.Dotnet-003 on `GalaxyRepositoryClient`: change `_disposed` to an `int`, use `Interlocked.Exchange(ref _disposed, 1) != 0` in `DisposeAsync`, and `Volatile.Read(ref _disposed) != 0` in `ThrowIfDisposed`. A duplicated `MxGatewaySession`-style close-lock drain is unnecessary because `GalaxyRepositoryClient` does not own a per-call `SemaphoreSlim`.
|
||
|
||
**Resolution:** 2026-05-20 — Changed `GalaxyRepositoryClient._disposed` from `bool` to `int`; `DisposeAsync` now uses `Interlocked.Exchange(ref _disposed, 1) != 0` for the once-only guard and `ThrowIfDisposed` uses `Volatile.Read(ref _disposed) != 0`, mirroring the Client.Dotnet-003 fix on `MxGatewayClient`.
|
||
|
||
### Client.Dotnet-010
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-005 fixed the silent `Register` / `AddItem` / `AddItem2` handle-fallback to `reply.ReturnValue.Int32Value` inside `MxGatewaySession`, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. `BenchReadBulkAsync` (line 638) and `BenchStreamEventsAsync` (line 896) both do `int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;` after a register call, and `SmokeAsync` (lines 1261 and 1279) passes `reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value` and the equivalent `AddItem?.ItemHandle` selector to `InvokeForHandleAsync`. After `EnsureProtocolSuccess` + `EnsureMxAccessSuccess` pass but the worker did not set the typed `register` / `add_item` oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents.
|
||
|
||
**Recommendation:** Either delegate to the SDK helpers (`MxGatewaySession.RegisterAsync` / `AddItemAsync`) which already throw the descriptive `MxGatewayException` via `CreateMissingPayloadException`, or replicate the same null-check explicitly in `InvokeForHandleAsync` and the two bench commands. A unit test that enqueues an `Ok` reply with no typed payload through `FakeCliClient` and asserts the smoke / bench commands fail loudly would prevent regression.
|
||
|
||
**Resolution:** 2026-05-20 — Added private CLI helpers `RequireRegisterServerHandle` and `RequireAddItemItemHandle` (with a shared `CreateMissingPayloadException` mirroring the SDK-level `MxGatewaySession` helper) that throw a descriptive `MxGatewayException` when the typed `register` / `add_item` payload is absent on an otherwise-successful reply. Replaced all four `?? reply.ReturnValue.Int32Value` fallback sites — `BenchReadBulkAsync` (line 638), `BenchStreamEventsAsync` (line 896), and both `SmokeAsync` selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle.
|
||
|
||
### Client.Dotnet-011
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** The new `bench-stream-events` command (added in commit `1cd51bb`) supports `--session-count > 1` and runs each session's `StreamEvents` reader in parallel via `openedSessions.Select(RunStreamAsync).ToArray()` then `Task.WhenAll`. Inside the per-session lambda the inner `Task.Run`-spawned event loop updates two shared `DateTime?` fields without synchronisation:
|
||
|
||
```csharp
|
||
if (firstSteadyEventUtc is null)
|
||
{
|
||
firstSteadyEventUtc = nowUtc;
|
||
}
|
||
lastSteadyEventUtc = nowUtc;
|
||
```
|
||
|
||
The integer counters next to them (`steadyEvents`, `steadyDataChangeEvents`, `warmupEvents`) use `Interlocked.Increment`, and the latency list uses an explicit `lock (latencyLock)`, so the rest of the loop is data-race-free — but these two `DateTime?` updates are not. With N parallel sessions a torn read on `firstSteadyEventUtc` produces a non-deterministic "first event time" and the final `steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds` can compute a slightly wrong window. The user-visible impact is bench-only (skewed `eventsPerSecond` / `dataChangeEventsPerSecond` numbers), and on x64 the 64-bit `DateTime` field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop.
|
||
|
||
**Recommendation:** Either guard the two `DateTime?` updates with the existing `latencyLock` (cheapest), use `Interlocked.CompareExchange` for `firstSteadyEventUtc` and `Volatile.Write` for `lastSteadyEventUtc`, or aggregate per-session in local variables and reduce after `Task.WhenAll`. The reduce-after approach also fixes a related issue: today a faster session can stomp `firstSteadyEventUtc` after a slower one already set it.
|
||
|
||
**Resolution:** 2026-05-20 — Guarded the `firstSteadyEventUtc` / `lastSteadyEventUtc` reads and writes inside the per-session event loop with the existing `latencyLock`. `firstSteadyEventUtc` now uses the null-coalescing assignment `firstSteadyEventUtc ??= nowUtc;` under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after `Task.WhenAll` (happens-before applies) and stays lock-free.
|
||
|
||
### Client.Dotnet-012
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `src/Directory.Build.props` enforces `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, and `Deterministic=true` for every gateway / worker / contracts project, and `CLAUDE.md` calls this out as a baseline build property. The .NET client projects live under `clients/dotnet/` and there is no `Directory.Build.props` at `clients/` or `clients/dotnet/` — so none of those properties apply to `MxGateway.Client`, `MxGateway.Client.Cli`, or `MxGateway.Client.Tests`. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The `CSharpStyleGuide.md` baseline ("Treat compiler warnings as actionable") and the `CLAUDE.md` table under "Source Update Workflow" both apply equally to `.NET client` ("`dotnet build clients/dotnet/MxGateway.Client.sln`"), but the enforcement floor is missing.
|
||
|
||
**Recommendation:** Add `clients/dotnet/Directory.Build.props` (or `clients/Directory.Build.props` covering Rust-Cargo siblings is N/A — only `clients/dotnet/`) carrying the same property set: `TreatWarningsAsErrors=true`, `EnforceCodeStyleInBuild=true`, `AnalysisLevel=latest`, `Deterministic=true`. Excluding generated code (which already lives under `src/MxGateway.Contracts/Generated`) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in.
|
||
|
||
**Resolution:** 2026-05-20 — Added `clients/dotnet/Directory.Build.props` mirroring `src/Directory.Build.props`: `LangVersion=latest`, `Nullable=enable`, `ImplicitUsings=enable`, `TreatWarningsAsErrors=true`, `AnalysisLevel=latest`, `EnforceCodeStyleInBuild=true`, `Deterministic=true`. The three client `.csproj` files inherit from it automatically. Re-ran `dotnet build clients/dotnet/MxGateway.Client.sln` and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated.
|
||
|
||
### Client.Dotnet-013
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Code organization & conventions |
|
||
| Location | `clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit:
|
||
|
||
- `DiscoverHierarchyOptions` (the public record) has no `<summary>` on the type and no XML doc on any of its ten public properties (`RootGobjectId`, `RootTagName`, `RootContainedPath`, `MaxDepth`, `CategoryIds`, `TemplateChainContains`, `TagNameGlob`, `IncludeAttributes`, `AlarmBearingOnly`, `HistorizedOnly`).
|
||
- The second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload on `GalaxyRepositoryClient` is `public` with no XML doc, while the parameterless overload one method above it carries a full `<summary>` / `<param>` block.
|
||
- `IMxGatewayCliClient` is a public interface in the CLI project with no `<summary>` on the type (the member docs are present).
|
||
|
||
This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation.
|
||
|
||
**Recommendation:** Add `<summary>` docs to each undocumented member. For `DiscoverHierarchyOptions`, the property names map cleanly to the underlying `DiscoverHierarchyRequest` proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves.
|
||
|
||
**Resolution:** 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on `DiscoverHierarchyOptions` (ten properties, mapped to the underlying `DiscoverHierarchyRequest` proto fields and noting the root-precedence rule); a `<summary>`/`<param>`/`<returns>` block on the second `DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken)` overload describing its filter semantics and transparent pagination; and a type-level `<summary>` on the public `IMxGatewayCliClient` interface explaining its CLI-only transport role and the production binding.
|
||
|
||
### Client.Dotnet-014
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Testing coverage |
|
||
| Location | `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Client.Dotnet-002 closed a coverage gap where the production retry path (`RpcException` → `MxGatewayException` mapping by `RpcExceptionMapper.Map`) was never exercised, by adding a `MapTransportExceptions` flag to `FakeGatewayTransport` and a regression test that runs through the wrapped-exception branch. That flag is wired through `Translate(...)` in `OpenSessionAsync` / `CloseSessionAsync` / `InvokeAsync`, but the new alarm test path is not: `FakeGatewayTransport.AcknowledgeAlarmAsync` throws the queued exception verbatim (line 219), bypassing `Translate`. The accompanying `MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts `Assert.ThrowsAsync<RpcException>` — but the production path through `GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync` (lines 120-134) already calls `RpcExceptionMapper.Map`, so production callers see `MxGatewayAuthenticationException` and not `RpcException`. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from `MxGatewayClient.AcknowledgeAlarmAsync` cannot regress on the alarm-ack mapping without somebody noticing.
|
||
|
||
**Recommendation:** Either route `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use and add a regression test that enables `MapTransportExceptions = true` and asserts `MxGatewayAuthenticationException`; or rename the existing test to make the pass-through shape explicit (e.g. `…_SurfacesRpcExceptionFromFakeTransportVerbatim`) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||
|
||
**Resolution:** 2026-05-20 — Applied both halves of the recommendation. Routed `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use, so when `MapTransportExceptions = true` thrown `RpcException`s now run through the production `RpcExceptionMapper.Map`. Renamed the existing pass-through test to `AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled` (with an updated comment pinning that this shape only applies when mapping is off), and added a new test `AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` that enables mapping and asserts the production-parity `MxGatewayAuthenticationException` with `StatusCode.Unauthenticated`. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||
|
||
### Client.Dotnet-015
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `CreateCancellation(arguments, command)` calls `cancellation.CancelAfter(timeout)` for every command except the explicitly long-running `galaxy-watch`, where `timeout` is `arguments.GetDuration("timeout", TimeSpan.FromSeconds(30))`. That same `--timeout` value is also fed into `CreateOptions` as `DefaultCallTimeout`, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both `bench-read-bulk` and `bench-stream-events` (introduced in `7db4bff` and `1cd51bb`) default to `--duration-seconds=30 --warmup-seconds=3`, which already exceeds the 30 s wall-clock budget; `bench-stream-events --session-count=N` adds another `750 ms × (N-1)` of `sessionStartStaggerMs` before the measurement window even opens.
|
||
|
||
A manual invocation such as `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ...` therefore cancels mid-window every time: the outer `CancellationTokenSource` trips at 30 s and the bench's inner `await Task.Delay(steadyEnd - warmupStart, cancellationToken)` throws an `OperationCanceledException` before `firstSteadyEventUtc`/`lastSteadyEventUtc` are even populated, producing a zero `steadyElapsedSeconds` / `0 eventsPerSecond` JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation.
|
||
|
||
`scripts/bench-read-bulk.ps1` already works around this for `bench-read-bulk` by computing `$callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30)` and passing `--timeout ${callTimeoutSeconds}s` (line 125), so the driver flow is correct. But there is no PowerShell wrapper for `bench-stream-events`, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers.
|
||
|
||
**Recommendation:** Either (a) extend the `isLongRunning` set in `CreateCancellation` to include `bench-read-bulk` and `bench-stream-events`, so manual invocation defers to caller-supplied `--timeout` and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor `--timeout` for the bench commands from `duration-seconds + warmup-seconds + headroom` the way the PS driver does; or (c) split the `--timeout` knob into a distinct per-call `--call-timeout` and outer `--wall-clock-timeout` and document the two roles. Option (a) is the smallest change and matches the existing `galaxy-watch` precedent. Add a CLI test that runs `bench-read-bulk` with `--duration-seconds=2 --warmup-seconds=0 --timeout=1s` and asserts the bench either errors loudly or completes (today it silently emits zeros).
|
||
|
||
**Resolution:** 2026-05-20 — Applied option (a): extended the `isLongRunning` set in `CreateCancellation` from `command is "galaxy-watch"` to `command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events"`, so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies `--timeout`. A caller-supplied `--timeout` still flows through to `DefaultCallTimeout` for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing `galaxy-watch` precedent and removes the silent zero-throughput failure mode without breaking the `scripts/bench-read-bulk.ps1` driver path (which explicitly raises `--timeout`).
|
||
|
||
### Client.Dotnet-016
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Concurrency & thread safety |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** `BenchStreamEventsAsync.RunStreamAsync` launches the per-session stream reader inside a `Task.Run(async () => { ... }, streamCts.Token)` and stores the returned task in the local `streamTask`. The recovery block
|
||
|
||
```csharp
|
||
await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false);
|
||
streamCts.Cancel();
|
||
try { await streamTask.ConfigureAwait(false); }
|
||
catch (OperationCanceledException) { }
|
||
catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { }
|
||
```
|
||
|
||
only awaits `streamTask` (and therefore only observes its exception) when `Task.Delay` returns normally. When the outer `cancellationToken` cancels during the delay — exactly the case Client.Dotnet-015 makes likely — `Task.Delay` throws `OperationCanceledException` and skips both `streamCts.Cancel()` and the `await streamTask`. The inner stream task is still alive at that point. The `using CancellationTokenSource streamCts = ...` on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting `OperationCanceledException` / mapped `MxGatewayException` is never observed. The local `streamTask` reference is dropped as `RunStreamAsync` unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a `TaskScheduler.UnobservedTaskException`.
|
||
|
||
The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this code path: the production `GrpcMxGatewayClientTransport.StreamEventsAsync` always wraps `RpcException` via `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled` (mapper line 31). So the inner task's cancellation exception is always `OperationCanceledException`, not `RpcException`. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case.
|
||
|
||
**Recommendation:** Restructure `RunStreamAsync` so the inner `streamTask` is always observed. A `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }` shape works (the `finally` runs even on outer cancellation). Alternatively, hoist `streamTask` into a local that the outer method's `try`/`finally` always awaits before exiting, so the per-session loop becomes `await Task.WhenAny(streamTask, Task.Delay(...))` then a guaranteed `await streamTask`. Drop the now-redundant `Grpc.Core.RpcException` catch or convert it to catch `MxGatewayException` for the wrapped shape (and document that it should never fire in production).
|
||
|
||
**Resolution:** 2026-05-20 — Restructured `RunStreamAsync` to wrap the `Task.Delay` in `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }`, so the inner stream task is observed on every path — including when the outer `cancellationToken` cancels during the delay. Dropped the dead `catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled)` clause (the production `GrpcMxGatewayClientTransport.StreamEventsAsync` routes through `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled`, so an `RpcException` never reaches here) and replaced it with `catch (MxGatewayException)` to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent `TaskScheduler.UnobservedTaskException` whenever the outer cancellation fires mid-measurement-window.
|
||
|
||
### Client.Dotnet-017
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262` |
|
||
| Status | Resolved |
|
||
|
||
**Description:** Surfaced during the 2026-05-20 cross-language e2e matrix run: `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- stream-events --endpoint http://localhost:5120 --api-key-env MXGATEWAY_API_KEY --timeout 60s --json --session-id session-... --max-events 200` exited with `-532462766` (unhandled-exception exit code) and propagated `System.OperationCanceledException: Call canceled by the client.` mapped from `Status(StatusCode="Cancelled", …)`. The CLI's `StreamEventsAsync` does `await foreach (... in client.StreamEventsAsync(...).WithCancellation(cancellationToken))` and never catches `OperationCanceledException`. When the caller's `--timeout` (driven by `CreateCancellation`'s `CancelAfter`) fires before `--max-events` is reached — the common case for a finite-window event collector against a quiet test rig — the foreach throws, the exception bubbles up, the process exits non-zero, and any `--json` aggregate output is never written. The other client CLIs (Go, Rust, Python, Java) all exit 0 in this case (e2e clients g/r/p ran clean). The bug is also a strict regression of the CLI's contract: callers can't tell "stream collected 0–N events then the budget closed" apart from "the call genuinely failed".
|
||
|
||
**Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.
|
||
|
||
**Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.
|
||
|
||
#### 2026-05-24 re-review (commit 42b0037)
|
||
|
||
Re-review pass at `42b0037`. Diff against `d692232` consists of four commits:
|
||
|
||
- `71d2c39` — ports the `batch` subcommand to all five client CLIs (the .NET
|
||
half adds `RunBatchAsync`, the `__MXGW_BATCH_EOR__` sentinel, the
|
||
`forceJsonErrors` plumbing on `RunCoreAsync`, and the two
|
||
`RunAsync_Batch_*` tests).
|
||
- `b3ae200` — ports the bulk read/write SDK methods + matching CLI
|
||
subcommands (the SDK side adds `WriteBulkAsync` / `Write2BulkAsync` /
|
||
`WriteSecuredBulkAsync` / `WriteSecured2BulkAsync` / `ReadBulkAsync` on
|
||
`MxGatewaySession`; the CLI side adds `read-bulk`, `write-bulk`,
|
||
`write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and the
|
||
cross-language stress harness `bench-read-bulk`).
|
||
- `11cc671` — ports `stream-alarms` and `acknowledge-alarm` to the CLI and
|
||
the SDK transport (`StreamAlarmsAsync` on `MxGatewayClient`,
|
||
`IMxGatewayClientTransport`, `GrpcMxGatewayClientTransport`,
|
||
`IMxGatewayCliClient`, `MxGatewayCliClientAdapter`, and
|
||
`FakeGatewayTransport`; two new `MxGatewayClientCliTests` covering
|
||
acknowledge + stream-alarms with `--max-events` and payload-case
|
||
distinction). The same commit wraps `StreamEventsAsync`'s `await foreach`
|
||
with `catch (OperationCanceledException) when
|
||
(cancellationToken.IsCancellationRequested)` (matching the Client.Dotnet-017
|
||
resolution shape) and mirrors the catch on the new `StreamAlarmsAsync`
|
||
CLI handler. `RunBatchAsync`'s outer catch is `catch (Exception
|
||
exception)` with no `when` guard — OCE thrown out of `RunCoreAsync` is
|
||
caught here and written as a JSON error so the batch process keeps
|
||
going, while the inner `RunCoreAsync` catch retains its `when
|
||
(exception is not OperationCanceledException)` guard so the non-batch
|
||
CLI behaviour is unchanged.
|
||
- `8738735` — updates each client's README with `stream-alarms` /
|
||
`acknowledge-alarm` usage and SDK examples.
|
||
|
||
| # | Category | Result |
|
||
|---|---|---|
|
||
| 1 | Correctness & logic bugs | Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021). |
|
||
| 2 | mxaccessgw conventions | Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018). |
|
||
| 3 | Concurrency & thread safety | No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`. |
|
||
| 4 | Error handling & resilience | Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly. |
|
||
| 5 | Security | No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths. |
|
||
| 6 | Performance & resource management | No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category). |
|
||
| 7 | Design-document adherence | No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less. |
|
||
| 8 | Code organization & conventions | No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern. |
|
||
| 9 | Testing coverage | No new issues — `RunAsync_StreamAlarms_*`, `RunAsync_AcknowledgeAlarm_*`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here. |
|
||
| 10 | Documentation & comments | Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete. |
|
||
|
||
### Client.Dotnet-018
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Medium |
|
||
| Category | Documentation & comments |
|
||
| Location | `clients/dotnet/README.md:137-138` |
|
||
| Status | Open |
|
||
|
||
**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows:
|
||
|
||
```
|
||
mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json
|
||
mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
|
||
```
|
||
|
||
None of these flags exist. The actual CLI consumes `--filter-prefix`, `--max-events`, `[--jsonl]` for `stream-alarms` and `--reference`, `--comment`, `--operator`, `--json` for `acknowledge-alarm` (see `WriteUsage` lines 891-892 and the `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` handlers in `MxGatewayClientCli.cs`). Neither command takes a `--session-id` — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on `MxGatewayClient.StreamAlarmsAsync` lines 226-237). A user who copies the README example will get a `--session-id` "unknown option" failure on both commands and an "Argument 'reference' is required" failure on `acknowledge-alarm` (since `--alarm-reference` is silently treated as `--alarm-reference` rather than `--reference`).
|
||
|
||
This is exactly the docs-in-the-same-change rule from `CLAUDE.md` ("Update docs in the same change as the source") — the README change in `8738735` documents the wrong surface for the source change in `11cc671`. The built-in `WriteUsage` text is correct; only the README walkthrough is wrong.
|
||
|
||
**Recommendation:** Replace the two example lines with the actual flag names:
|
||
|
||
```
|
||
mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json
|
||
mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json
|
||
```
|
||
|
||
A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.
|
||
|
||
### Client.Dotnet-019
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` |
|
||
| Status | Open |
|
||
|
||
**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745:
|
||
|
||
```csharp
|
||
int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;
|
||
```
|
||
|
||
The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`, steady-state `ReadBulk`, `UnsubscribeBulk` — against a zero `ServerHandle`. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading `cachedReadResults = 0` summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level `RequireRegisterServerHandle` / `RequireAddItemItemHandle` helper was introduced, but neither helper currently exists in the codebase, and the SDK-level `MxGatewaySession.RegisterAsync` / `AddItemAsync` / `AddItem2Async` at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance.
|
||
|
||
**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection.
|
||
|
||
### Client.Dotnet-020
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Error handling & resilience |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` |
|
||
| Status | Open |
|
||
|
||
**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`:
|
||
|
||
```csharp
|
||
try
|
||
{
|
||
reply = await client.InvokeAsync(
|
||
CreateCommandRequest(sessionId, readBulkMxCommand),
|
||
cancellationToken)
|
||
.ConfigureAwait(false);
|
||
sw.Stop();
|
||
}
|
||
catch
|
||
{
|
||
sw.Stop();
|
||
failedCalls++;
|
||
latencyMillis.Add(sw.Elapsed.TotalMilliseconds);
|
||
continue;
|
||
}
|
||
```
|
||
|
||
The `catch` has no type filter and no `when (!cancellationToken.IsCancellationRequested)` clause, so `OperationCanceledException` flowing from a cancelled `cancellationToken` (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until `DateTime.UtcNow` reaches `steadyDeadline`, with each iteration immediately throwing OCE from the next `InvokeAsync` call — a tight CPU-burn until the configured `--duration-seconds` elapses, producing a per-iteration `latencyMillis.Add(sw.Elapsed.TotalMilliseconds)` flood that also skews the final p99/max numbers down.
|
||
|
||
The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the `finally` block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on `cancellationToken.IsCancellationRequested`. Note that Client.Dotnet-015's resolution removed the outer `CancelAfter(timeout)` for `bench-read-bulk` / `bench-stream-events`, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit.
|
||
|
||
**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture.
|
||
|
||
### Client.Dotnet-021
|
||
|
||
| Field | Value |
|
||
|---|---|
|
||
| Severity | Low |
|
||
| Category | Correctness & logic bugs |
|
||
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` |
|
||
| Status | Open |
|
||
|
||
**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking:
|
||
|
||
```csharp
|
||
// ReadBulkAsync (line 487)
|
||
TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0),
|
||
|
||
// BenchReadBulkAsync (line 715)
|
||
uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500);
|
||
```
|
||
|
||
`arguments.GetInt32(...)` returns `Int32.Parse(...)`, so a caller supplying `--timeout-ms=-1` (an easy mistake for "no timeout" or copy-paste from another tool that uses `-1` for unbounded) silently wraps to `0xFFFFFFFF` = ~49.7 days as the worker-side `ReadBulk` timeout. The gateway forwards that as the per-tag `timeout_ms` on `ReadBulkCommand`, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's `uint32 timeout_ms` field was meant to prevent. The `ReadBulkAsync` SDK overload on `MxGatewaySession` (line 663) already guards correctly: `timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue)`; the CLI handlers should match.
|
||
|
||
**Recommendation:** Wrap the cast in a guard, e.g.:
|
||
|
||
```csharp
|
||
int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0);
|
||
if (timeoutMsRaw < 0)
|
||
{
|
||
throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default).");
|
||
}
|
||
uint timeoutMs = (uint)timeoutMsRaw;
|
||
```
|
||
|
||
A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.
|