Files
mxaccessgw/code-reviews/Client.Dotnet/findings.md
T
Joseph Doherty 89043cb2b6 Resolve Client.Dotnet-004..008 code-review findings
Client.Dotnet-004: documented DefaultCallTimeout as both the per-attempt
deadline and the shared retry budget, and removed DeadlineExceeded from the
transient-retry set (a client-imposed deadline cannot be helped by retrying).

Client.Dotnet-005: RegisterAsync/AddItemAsync/AddItem2Async silently returned
0 when a successful reply lacked the typed payload. They now throw a
descriptive MxGatewayException.

Client.Dotnet-006: added XML docs to the previously undocumented public
members MaxGrpcMessageBytes, GatewayProtocolVersion, WorkerProtocolVersion.

Client.Dotnet-007: corrected the AcknowledgeAlarmAsync XML comment — the RPC
requires the admin scope, not a non-existent invoke:alarm-ack sub-scope.

Client.Dotnet-008: the CLI redactor missed env-var-sourced keys because the
caller passed only the --api-key option. Redaction now uses the same
resolver, stripping env-var keys too.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:27 -04:00

148 lines
16 KiB
Markdown

# Code Review — Client.Dotnet
| Field | Value |
|---|---|
| Module | `clients/dotnet` |
| Reviewer | Claude Code |
| Review date | 2026-05-18 |
| Commit reviewed | `3cc53a8` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Minor: handle-selector fallback `?? reply.ReturnValue.Int32Value` can mask a missing typed reply (Client.Dotnet-005); CLI redactor misses env-var keys (Client.Dotnet-008). |
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. |
| 3 | Concurrency & thread safety | Issue found: `_disposed` flags unsynchronized; `MxGatewaySession.DisposeAsync` can race a concurrent `CloseAsync` (Client.Dotnet-003). |
| 4 | Error handling & resilience | Issues found: gRPC-to-native mapping collapses non-auth statuses into one untyped exception (Client.Dotnet-001); shared retry/timeout budget (Client.Dotnet-004). |
| 5 | Security | Good — API key never logged by the library, CLI redacts keys, TLS custom-root validation correct. |
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly. |
| 7 | Design-document adherence | No issues found — matches `ClientLibrariesDesign.md`. |
| 8 | Code organization & conventions | Issue found: undocumented public members (Client.Dotnet-006). |
| 9 | Testing coverage | Issue found: the production retry path is never exercised (Client.Dotnet-002). |
| 10 | Documentation & comments | Issue found: doc misstates the unary timeout retry budget as per-call (Client.Dotnet-004, Client.Dotnet-007). |
## Findings
### Client.Dotnet-001
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | `clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140` |
| Status | Resolved |
**Description:** `MapRpcException` only produces typed exceptions for `Unauthenticated` and `PermissionDenied`. Every other gRPC status — `NotFound`, `InvalidArgument`, `ResourceExhausted`, `FailedPrecondition`, `Unavailable`, `Internal` — collapses into the base `MxGatewayException` with no surfaced `StatusCode`. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into `InnerException` and downcasting to `RpcException`.
**Recommendation:** Carry the gRPC `StatusCode` on `MxGatewayException` (e.g. a `StatusCode` property) and/or add typed subclasses for at least `NotFound`, `InvalidArgument`, and `Unavailable`. Populate it from `exception.StatusCode` in `MapRpcException`.
**Resolution:** (2026-05-18) Confirmed against source: both transports had a duplicated private `MapRpcException` that only typed two statuses and discarded the gRPC code for the rest. Added a nullable `StatusCode` property (`Grpc.Core.StatusCode?`) to `MxGatewayException` plus constructors that carry it, threaded it through `MxGatewayAuthenticationException`/`MxGatewayAuthorizationException`, and extracted the two duplicated mappers into a single shared internal `RpcExceptionMapper` (`RpcExceptionMapper.cs`) that populates `StatusCode` from `exception.StatusCode` for every status. Callers can now distinguish transient from permanent failures without downcasting `InnerException`. Documented in `clients/dotnet/README.md`. Regression test: `RpcExceptionMapperTests` (8 cases incl. the `[Theory]` over `NotFound`/`InvalidArgument`/`ResourceExhausted`/`FailedPrecondition`/`Unavailable`/`Internal`).
### Client.Dotnet-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256` |
| Status | Resolved |
**Description:** The retry predicate `MxGatewayClientRetryPolicy.IsTransientGrpcFailure` handles two shapes: a raw `RpcException` and an `MxGatewayException { InnerException: RpcException }`. In production the transport always maps `RpcException``MxGatewayException` before it reaches the retry pipeline, so only the wrapped-`MxGatewayException` branch ever runs in production. But `FakeGatewayTransport` throws the raw `RpcException` and never maps it, so every retry test exercises only the raw-`RpcException` branch — the branch that never occurs in production. The production retry behaviour is effectively untested.
**Recommendation:** Add a fake/transport mode that maps `RpcException` to `MxGatewayException` the way `GrpcMxGatewayClientTransport` does (or add tests that enqueue a pre-wrapped `MxGatewayException`), so the actually-used predicate branch is covered.
**Resolution:** (2026-05-18) Confirmed against source: `FakeGatewayTransport` threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-`RpcException` predicate branch. Added a `MapTransportExceptions` flag to `FakeGatewayTransport` that, when set, runs thrown `RpcException`s through the same shared `RpcExceptionMapper` the production gRPC transport uses, producing the wrapped `MxGatewayException` shape. Added regression test `MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException`, which exercises the previously-untested production predicate branch. Verified red: removing the `MxGatewayException { InnerException: RpcException }` case from `IsTransientGrpcFailure` fails the new test while the pre-existing raw-`RpcException` test still passes.
### Client.Dotnet-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240` |
| Status | Resolved |
**Description:** `DisposeAsync` calls `CloseAsync()` (no token) then unconditionally `_closeLock.Dispose()`. If another thread is concurrently awaiting `CloseAsync(token)` — legal, since the type exposes public async methods and no single-threaded contract — disposing the `SemaphoreSlim` while a `WaitAsync` is pending throws `ObjectDisposedException` into that caller. The `_disposed` flags in both clients are also plain unsynchronised `bool` reads/writes; `ThrowIfDisposed` racing `DisposeAsync` can observe a stale value.
**Recommendation:** Either document `MxGatewaySession`/`MxGatewayClient` as not thread-safe for concurrent dispose, or guard `_disposed` with `Interlocked`/`volatile` and avoid disposing `_closeLock` until all in-flight `CloseAsync` calls complete.
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewaySession.DisposeAsync` disposed `_closeLock` unconditionally, racing concurrent `CloseAsync` callers; `MxGatewayClient._disposed` was a plain `bool`. Fixed `MxGatewaySession` by tracking in-flight `CloseAsync` callers with an `_activeCloseCount` guarded by a dedicated `_disposeGate` lock and a `_closeLockDisposed` flag: `CloseAsync` registers under the gate (and throws `ObjectDisposedException` if disposal already won) before awaiting `_closeLock.WaitAsync`, and `DisposeAsync` drains `_activeCloseCount` to zero before disposing the semaphore, so the close lock provably outlives every pending `WaitAsync`. Fixed `MxGatewayClient` by changing `_disposed` to an `int` accessed via `Interlocked.Exchange`/`Volatile.Read`. Regression test `MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync` runs 100 iterations with one close holding the lock and one parked behind it while `DisposeAsync` runs concurrently; verified red against the original `DisposeAsync` (fails with `ObjectDisposedException`), green after the fix.
### Client.Dotnet-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403` |
| Status | Resolved |
**Description:** `ExecuteSafeUnaryAsync` wraps the whole Polly retry pipeline in a single linked CTS cancelled after `Options.DefaultCallTimeout`, while `CreateCallOptions` also stamps each individual call with a `DefaultCallTimeout` gRPC deadline. The retry pipeline therefore shares one `DefaultCallTimeout` budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe `DefaultCallTimeout` as a per-call timeout, which misrepresents this. `DeadlineExceeded` is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again.
**Recommendation:** Decide whether `DefaultCallTimeout` is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on `DeadlineExceeded` when the deadline was client-imposed.
**Resolution:** (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use `DefaultCallTimeout`, and `IsTransientStatus` listed `DeadlineExceeded`. Resolved as a per-operation budget (the simpler, non-breaking choice): the `DefaultCallTimeout` XML doc in `MxGatewayClientOptions.cs` now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed `DeadlineExceeded` from `MxGatewayClientRetryPolicy.IsTransientStatus`: every unary deadline is client-imposed (`CreateCallOptions` stamps the shared budget), so a `DeadlineExceeded` means the budget is exhausted and an immediate retry can only fail again. Regression test `MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded` asserts the safe diagnostic command (`Ping`) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded).
### Client.Dotnet-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175` |
| Status | Resolved |
**Description:** `RegisterAsync`/`AddItemAsync`/`AddItem2Async` return `reply.<Typed>?.ServerHandle ?? reply.ReturnValue.Int32Value`. After `EnsureMxAccessSuccess()` passes, a missing typed payload silently falls back to `ReturnValue.Int32Value`, which for a reply carrying no return value is `0`. A caller then uses `0` as a `ServerHandle`/`ItemHandle`, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error.
**Recommendation:** If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive `MxGatewayException`) rather than falling through to `ReturnValue.Int32Value`.
**Resolution:** (2026-05-18) Confirmed against source and `mxaccess_gateway.proto`: `register`/`add_item`/`add_item2` are members of the `MxCommandReply.payload` oneof, so the typed accessor is `null` whenever the worker did not set that case — and the fallback returned `ReturnValue.Int32Value` (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: `RegisterAsync`/`AddItemAsync`/`AddItem2Async` throw via a new private `MxGatewaySession.CreateMissingPayloadException` helper that builds a descriptive `MxGatewayException` naming the missing payload, kind, session, and correlation id. Regression tests `MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload` and `AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload` enqueue an `Ok` reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned `0` instead of throwing).
### Client.Dotnet-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14` |
| Status | Resolved |
**Description:** `MxGatewayClientOptions.MaxGrpcMessageBytes` and the two `const`s in `MxGatewayClientContractInfo` are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface.
**Recommendation:** Add `<summary>` doc comments to `MaxGrpcMessageBytes`, `GatewayProtocolVersion`, and `WorkerProtocolVersion`.
**Resolution:** (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added `<summary>` comments to `MxGatewayClientOptions.MaxGrpcMessageBytes` (describing the 16 MiB default applied to both send and receive limits), and to `MxGatewayClientContractInfo.GatewayProtocolVersion` and `WorkerProtocolVersion` (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean.
### Client.Dotnet-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192` |
| Status | Resolved |
**Description:** The `AcknowledgeAlarmAsync` XML comment states the gateway authenticates against an `invoke:alarm-ack` scope, but `CLAUDE.md` documents the scope set without any `invoke:alarm-ack` sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need.
**Recommendation:** Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync.
**Resolution:** (2026-05-18) Confirmed against the server-side authorization model: `GatewayGrpcScopeResolver.ResolveRequiredScope` has no arm for `AcknowledgeAlarmRequest`, so it falls to the `_ => GatewayScopes.Admin` default — the RPC actually requires the `admin` scope. No `invoke:alarm-ack` sub-scope exists anywhere in `GatewayScopes`. The client XML comment on `AcknowledgeAlarmAsync` was wrong, not the docs. Corrected the comment to state the gateway authorizes `AcknowledgeAlarmRequest` against the API key's `admin` scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed.
### Client.Dotnet-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17` |
| Status | Resolved |
**Description:** The CLI redactor only removes the API key string when it was supplied via `--api-key`; `RunCoreAsync` passes `arguments.GetOptional("api-key")` to `Redact`. When the key comes from an environment variable (`--api-key-env`, the documented default path), `apiKey` is `null` and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted.
**Recommendation:** Resolve the effective API key (same logic as `ResolveApiKey`) before redacting, so the env-var-sourced key is also stripped from error output.
**Resolution:** (2026-05-18) Confirmed against source: `MxGatewayClientCli.RunCoreAsync`'s catch block redacted only `arguments.GetOptional("api-key")`, so an env-var-sourced key (`--api-key-env`, default `MXGATEWAY_API_KEY`) was never stripped. Note `MxGatewayCliSecretRedactor` itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing `TryResolveApiKey` helper (used by both the existing `ResolveApiKey` and the catch block) that resolves `--api-key` then the `--api-key-env` environment variable; the catch block now redacts that effective key. Updated `clients/dotnet/README.md` (`smoke` paragraph) to state the CLI redacts the effective key whether from `--api-key` or `--api-key-env`. Regression test `MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable` sets a test env var, forces a transport error echoing the key, and asserts the key is absent and `[redacted]` is present; verified red against the original `GetOptional("api-key")`-only redaction (key printed unredacted).