Files
mxaccessgw/code-reviews/Client.Dotnet/findings.md
T
Joseph Doherty d2d2e5f68f code-review 2026-05-24: re-review at d692232 across all 11 modules
Restores the `code-reviews/` tree (was unwritten on this working copy)
and re-reviews every module per `REVIEW-PROCESS.md` against HEAD
`d692232`. The diff in scope is the five commits since the last sweep:
`dc9c0c9` (ZB.MOM.WW gateway-side rename + slnx migrate),
`397d3c5` (client SDK rename + the missing alarm-RPC proto types and
the .NET DiscoverHierarchyOptions POCO), `27ed651` (role-based LDAP
auth + HubToken bearer, drop PathBase), `6594359` (sidebar layout +
three SignalR push hubs), and `d692232` (EventsHub publisher + doc
refresh).

Module status

| Module | Open | Total | Delta this pass |
|---|---|---|---|
| Server           | 8 | 43 | +6 |
| Contracts        | 2 | 17 | +2 |
| Tests            | 2 | 26 | +2 |
| IntegrationTests | 3 | 24 | +3 |
| Client.Java      | 5 | 31 | +5 |
| Client.Rust      | 1 | 21 | +1 |
| Worker           | 0 | 25 |  0 (rename-only diff, clean) |
| Worker.Tests     | 0 | 30 |  0 (rename-only diff, clean) |
| Client.Dotnet    | 0 | 17 |  0 (rename + alarm-fix diff, clean) |
| Client.Python    | 0 | 21 |  0 (rename + alarm-fix diff, clean) |
| Client.Go        | 0 | 21 |  0 (rename + alarm-fix diff, clean) |

Total new findings: 19. Severity breakdown: 1 Medium-security
(Server-038), 4 Medium-documentation/coverage, 14 Low.

New findings

  * Server-038 (Medium / Security) — EventsHub.SubscribeSession accepts
    any session id from any Viewer; no per-session ACL guards the
    EventsHub group fan-out.
  * Server-039 (Low / Error handling) — HubTokenService.Validate
    accepts a payload with null Name/NameIdentifier.
  * Server-040 (Low / Conventions) — MapGroupsToRoles undocumented
    full-vs-RDN lookup precedence.
  * Server-041 (Low / Design adherence) — EventStreamService calls
    IDashboardEventBroadcaster.Publish without a try/catch — fragile
    seam relying on the never-throw contract.
  * Server-042 (Low / Performance) — DashboardSnapshotPublisher tight
    retry loop with no backoff (vs AlarmsHubPublisher 5s delay).
  * Server-043 (Low / Documentation) — HubTokenService singleton
    sharing across login + hub-token validation undocumented.

  * Contracts-016 (Low / Conventions) — QueryActiveAlarmsRequest.session_id
    reserved-for-future-use ambiguity.
  * Contracts-017 (Low / Documentation) — rpc QueryActiveAlarms doc
    omits the alarm_filter_prefix filter description.

  * Tests-025 (Low / Conventions) — duplicate NullDashboardEventBroadcaster
    fakes in EventStreamServiceTests and GatewayEndToEndFakeWorkerSmokeTests.
  * Tests-026 (Medium / Testing coverage) — no test proves
    EventStreamService actually calls IDashboardEventBroadcaster.Publish.

  * IntegrationTests-022 (Low / Conventions) — ResolveRepositoryRoot
    silent fallback to Directory.GetCurrentDirectory().
  * IntegrationTests-023 (Low / Testing coverage) — DashboardLdapLiveTests
    success-path asserts ldap_group but not the Role claim.
  * IntegrationTests-024 (Low / Conventions) — inline
    NullDashboardEventBroadcaster fake duplicates Tests-side copies.

  * Client.Java-027 (Medium / Documentation) — README + JavaClientDesign
    Gradle task names still use the old short project names.
  * Client.Java-028 (Medium / Design adherence) — JavaClientDesign
    build-layout shows the old `com/dohertylan/mxgateway/` package paths.
  * Client.Java-029 (Low / Documentation) — README installDist path
    cites the wrong directory.
  * Client.Java-030 (Low / Testing coverage) — no Java test exercises
    the regenerated QueryActiveAlarmsRequest RPC.
  * Client.Java-031 (Low / Conventions) — README prose uses old short
    project names instead of canonical prefixed ones.

  * Client.Rust-021 (Low / Design adherence) — RustClientDesign.md
    "Crate layout" shows an aspirational nested `crates/zb-mom-ww-mxgateway-client/`
    that does not exist; actual layout is the flat top-level crate.

Two pre-existing pending findings (Server-031 lock-contention,
Server-032 bounded event channel) remain unchanged — neither was
touched by this wave of commits.

Process notes

- The `code-reviews/` tree was not in this working copy's git
  history (the local extract pre-dates the divergent branch that
  carried the reviews). Restored from `dd7ca16` via
  `git checkout dd7ca16 -- code-reviews/` before the re-review.
- Some "Resolved" entries in the restored findings.md reference
  fixes that landed on the divergent branch (the same one that
  carried the reviews) and are not present on the current main
  lineage. The re-review treats those statuses as historical;
  the new pass only files findings against HEAD's actual state.
- `python code-reviews/regen-readme.py --check` is green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 02:34:30 -04:00

44 KiB
Raw Blame History

Code Review — Client.Dotnet

Field Value
Module clients/dotnet
Reviewer Claude Code
Review date 2026-05-24
Commit reviewed d692232
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Issue found (this review): the global CLI --timeout defaults to 30 s and is used both as the gRPC DefaultCallTimeout and as the outer CancelAfter budget — but bench-read-bulk / bench-stream-events default to --duration-seconds=30 --warmup-seconds=3 (+ stagger), so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The scripts/bench-read-bulk.ps1 driver works around this by raising --timeout, but bench-stream-events has no driver script.
2 mxaccessgw conventions Good — consumes the shared contracts project, no forked proto, authorization: Bearer metadata correct, parity preserved via split EnsureProtocolSuccess/EnsureMxAccessSuccess. The new clients/dotnet/Directory.Build.props mirrors src/Directory.Build.props exactly (same six properties, identical values) so the enforcement floor is back in scope.
3 Concurrency & thread safety Issue found (this review): BenchStreamEventsAsync's per-session RunStreamAsync hands the inner Task.Run stream loop a reference (streamTask) that becomes unobserved whenever the outer cancellationToken cancels during the bench's await Task.Delay — the await streamTask recovery path never runs, so any inner OCE / RpcException raised after cancellation surfaces as a TaskScheduler.UnobservedTaskException (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied.
4 Error handling & resilience No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; RpcExceptionMapper is consistently called from both gateway and Galaxy transports incl. AcknowledgeAlarmAsync after Client.Dotnet-014).
5 Security Good — API key never logged by the library, CLI redacts effective key (both --api-key and --api-key-env sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged.
6 Performance & resource management No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op.
7 Design-document adherence No issues found — matches DotnetClientDesign.md and ClientLibrariesDesign.md.
8 Code organization & conventions No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on DiscoverHierarchyOptions, the second DiscoverHierarchyAsync overload, and IMxGatewayCliClient) are both fully resolved; the new props file is a faithful mirror of the production one.
9 Testing coverage No new issues — Client.Dotnet-014 closed the alarm-side Translate gap. The new bench paths (bench-read-bulk, bench-stream-events) have no unit-test coverage, but they are stress harnesses driven by scripts/bench-read-bulk.ps1, not SDK API surface, so this is not flagged.
10 Documentation & comments No new issues this review (Client.Dotnet-007's alarm-ack admin-scope correction holds; DefaultCallTimeout doc accurately reflects the shared-budget semantics from Client.Dotnet-004).

2026-05-24 review (commit d692232)

Re-review pass at d692232. Diff against a020350 consists of the ZB.MOM.WW client prefix rename in commit 397d3c5 (folders, csprojs, sln→slnx, every namespace and using) plus the hand-written DiscoverHierarchyOptions.cs POCO and the dropped retired SessionId = lines from alarm-related test fixtures. The rename was applied via a case-insensitive regex sweep; no over-rename artifacts found. The mxgw_* API-key wire prefix, MXGATEWAY_* environment variables, and the MxGatewayClient / MxGatewaySession type names are unchanged. Build and tests are green at HEAD.

# Category Result
1 Correctness & logic bugs No issues found in the a020350..d692232 diff.
2 mxaccessgw conventions No issues found — rename hygiene clean; wire identifiers preserved.
3 Concurrency & thread safety No issues found in this diff.
4 Error handling & resilience No issues found in this diff.
5 Security No issues found in this diff.
6 Performance & resource management No issues found in this diff.
7 Design-document adherence No issues found in this diff — DotnetClientDesign.md reflects the new layout.
8 Code organization & conventions No issues found in this diff.
9 Testing coverage No issues found in this diff — MxGatewayClientAlarmsTests fixtures correctly drop SessionId from AcknowledgeAlarmRequest/Reply and retain it on QueryActiveAlarmsRequest.
10 Documentation & comments No issues found in this diff.

Findings

Client.Dotnet-001

Field Value
Severity Medium
Category Error handling & resilience
Location clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199, clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140
Status Resolved

Description: MapRpcException only produces typed exceptions for Unauthenticated and PermissionDenied. Every other gRPC status — NotFound, InvalidArgument, ResourceExhausted, FailedPrecondition, Unavailable, Internal — collapses into the base MxGatewayException with no surfaced StatusCode. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into InnerException and downcasting to RpcException.

Recommendation: Carry the gRPC StatusCode on MxGatewayException (e.g. a StatusCode property) and/or add typed subclasses for at least NotFound, InvalidArgument, and Unavailable. Populate it from exception.StatusCode in MapRpcException.

Resolution: (2026-05-18) Confirmed against source: both transports had a duplicated private MapRpcException that only typed two statuses and discarded the gRPC code for the rest. Added a nullable StatusCode property (Grpc.Core.StatusCode?) to MxGatewayException plus constructors that carry it, threaded it through MxGatewayAuthenticationException/MxGatewayAuthorizationException, and extracted the two duplicated mappers into a single shared internal RpcExceptionMapper (RpcExceptionMapper.cs) that populates StatusCode from exception.StatusCode for every status. Callers can now distinguish transient from permanent failures without downcasting InnerException. Documented in clients/dotnet/README.md. Regression test: RpcExceptionMapperTests (8 cases incl. the [Theory] over NotFound/InvalidArgument/ResourceExhausted/FailedPrecondition/Unavailable/Internal).

Client.Dotnet-002

Field Value
Severity Medium
Category Testing coverage
Location clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148, clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256
Status Resolved

Description: The retry predicate MxGatewayClientRetryPolicy.IsTransientGrpcFailure handles two shapes: a raw RpcException and an MxGatewayException { InnerException: RpcException }. In production the transport always maps RpcExceptionMxGatewayException before it reaches the retry pipeline, so only the wrapped-MxGatewayException branch ever runs in production. But FakeGatewayTransport throws the raw RpcException and never maps it, so every retry test exercises only the raw-RpcException branch — the branch that never occurs in production. The production retry behaviour is effectively untested.

Recommendation: Add a fake/transport mode that maps RpcException to MxGatewayException the way GrpcMxGatewayClientTransport does (or add tests that enqueue a pre-wrapped MxGatewayException), so the actually-used predicate branch is covered.

Resolution: (2026-05-18) Confirmed against source: FakeGatewayTransport threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-RpcException predicate branch. Added a MapTransportExceptions flag to FakeGatewayTransport that, when set, runs thrown RpcExceptions through the same shared RpcExceptionMapper the production gRPC transport uses, producing the wrapped MxGatewayException shape. Added regression test MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException, which exercises the previously-untested production predicate branch. Verified red: removing the MxGatewayException { InnerException: RpcException } case from IsTransientGrpcFailure fails the new test while the pre-existing raw-RpcException test still passes.

Client.Dotnet-003

Field Value
Severity Medium
Category Concurrency & thread safety
Location clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663, clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240
Status Resolved

Description: DisposeAsync calls CloseAsync() (no token) then unconditionally _closeLock.Dispose(). If another thread is concurrently awaiting CloseAsync(token) — legal, since the type exposes public async methods and no single-threaded contract — disposing the SemaphoreSlim while a WaitAsync is pending throws ObjectDisposedException into that caller. The _disposed flags in both clients are also plain unsynchronised bool reads/writes; ThrowIfDisposed racing DisposeAsync can observe a stale value.

Recommendation: Either document MxGatewaySession/MxGatewayClient as not thread-safe for concurrent dispose, or guard _disposed with Interlocked/volatile and avoid disposing _closeLock until all in-flight CloseAsync calls complete.

Resolution: (2026-05-18) Confirmed against source: MxGatewaySession.DisposeAsync disposed _closeLock unconditionally, racing concurrent CloseAsync callers; MxGatewayClient._disposed was a plain bool. Fixed MxGatewaySession by tracking in-flight CloseAsync callers with an _activeCloseCount guarded by a dedicated _disposeGate lock and a _closeLockDisposed flag: CloseAsync registers under the gate (and throws ObjectDisposedException if disposal already won) before awaiting _closeLock.WaitAsync, and DisposeAsync drains _activeCloseCount to zero before disposing the semaphore, so the close lock provably outlives every pending WaitAsync. Fixed MxGatewayClient by changing _disposed to an int accessed via Interlocked.Exchange/Volatile.Read. Regression test MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync runs 100 iterations with one close holding the lock and one parked behind it while DisposeAsync runs concurrently; verified red against the original DisposeAsync (fails with ObjectDisposedException), green after the fix.

Client.Dotnet-004

Field Value
Severity Low
Category Error handling & resilience
Location clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294, clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403
Status Resolved

Description: ExecuteSafeUnaryAsync wraps the whole Polly retry pipeline in a single linked CTS cancelled after Options.DefaultCallTimeout, while CreateCallOptions also stamps each individual call with a DefaultCallTimeout gRPC deadline. The retry pipeline therefore shares one DefaultCallTimeout budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe DefaultCallTimeout as a per-call timeout, which misrepresents this. DeadlineExceeded is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again.

Recommendation: Decide whether DefaultCallTimeout is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on DeadlineExceeded when the deadline was client-imposed.

Resolution: (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use DefaultCallTimeout, and IsTransientStatus listed DeadlineExceeded. Resolved as a per-operation budget (the simpler, non-breaking choice): the DefaultCallTimeout XML doc in MxGatewayClientOptions.cs now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed DeadlineExceeded from MxGatewayClientRetryPolicy.IsTransientStatus: every unary deadline is client-imposed (CreateCallOptions stamps the shared budget), so a DeadlineExceeded means the budget is exhausted and an immediate retry can only fail again. Regression test MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded asserts the safe diagnostic command (Ping) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded).

Client.Dotnet-005

Field Value
Severity Low
Category Correctness & logic bugs
Location clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175
Status Resolved

Description: RegisterAsync/AddItemAsync/AddItem2Async return reply.<Typed>?.ServerHandle ?? reply.ReturnValue.Int32Value. After EnsureMxAccessSuccess() passes, a missing typed payload silently falls back to ReturnValue.Int32Value, which for a reply carrying no return value is 0. A caller then uses 0 as a ServerHandle/ItemHandle, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error.

Recommendation: If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive MxGatewayException) rather than falling through to ReturnValue.Int32Value.

Resolution: (2026-05-18) Confirmed against source and mxaccess_gateway.proto: register/add_item/add_item2 are members of the MxCommandReply.payload oneof, so the typed accessor is null whenever the worker did not set that case — and the fallback returned ReturnValue.Int32Value (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: RegisterAsync/AddItemAsync/AddItem2Async throw via a new private MxGatewaySession.CreateMissingPayloadException helper that builds a descriptive MxGatewayException naming the missing payload, kind, session, and correlation id. Regression tests MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload and AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload enqueue an Ok reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned 0 instead of throwing).

Client.Dotnet-006

Field Value
Severity Low
Category Code organization & conventions
Location clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50, clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14
Status Resolved

Description: MxGatewayClientOptions.MaxGrpcMessageBytes and the two consts in MxGatewayClientContractInfo are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface.

Recommendation: Add <summary> doc comments to MaxGrpcMessageBytes, GatewayProtocolVersion, and WorkerProtocolVersion.

Resolution: (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added <summary> comments to MxGatewayClientOptions.MaxGrpcMessageBytes (describing the 16 MiB default applied to both send and receive limits), and to MxGatewayClientContractInfo.GatewayProtocolVersion and WorkerProtocolVersion (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean.

Client.Dotnet-007

Field Value
Severity Low
Category Documentation & comments
Location clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192
Status Resolved

Description: The AcknowledgeAlarmAsync XML comment states the gateway authenticates against an invoke:alarm-ack scope, but CLAUDE.md documents the scope set without any invoke:alarm-ack sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need.

Recommendation: Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync.

Resolution: (2026-05-18) Confirmed against the server-side authorization model: GatewayGrpcScopeResolver.ResolveRequiredScope has no arm for AcknowledgeAlarmRequest, so it falls to the _ => GatewayScopes.Admin default — the RPC actually requires the admin scope. No invoke:alarm-ack sub-scope exists anywhere in GatewayScopes. The client XML comment on AcknowledgeAlarmAsync was wrong, not the docs. Corrected the comment to state the gateway authorizes AcknowledgeAlarmRequest against the API key's admin scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed.

Client.Dotnet-008

Field Value
Severity Low
Category Correctness & logic bugs
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17
Status Resolved

Description: The CLI redactor only removes the API key string when it was supplied via --api-key; RunCoreAsync passes arguments.GetOptional("api-key") to Redact. When the key comes from an environment variable (--api-key-env, the documented default path), apiKey is null and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted.

Recommendation: Resolve the effective API key (same logic as ResolveApiKey) before redacting, so the env-var-sourced key is also stripped from error output.

Resolution: (2026-05-18) Confirmed against source: MxGatewayClientCli.RunCoreAsync's catch block redacted only arguments.GetOptional("api-key"), so an env-var-sourced key (--api-key-env, default MXGATEWAY_API_KEY) was never stripped. Note MxGatewayCliSecretRedactor itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing TryResolveApiKey helper (used by both the existing ResolveApiKey and the catch block) that resolves --api-key then the --api-key-env environment variable; the catch block now redacts that effective key. Updated clients/dotnet/README.md (smoke paragraph) to state the CLI redacts the effective key whether from --api-key or --api-key-env. Regression test MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable sets a test env var, forces a transport error echoing the key, and asserts the key is absent and [redacted] is present; verified red against the original GetOptional("api-key")-only redaction (key printed unredacted).

Client.Dotnet-009

Field Value
Severity Low
Category Concurrency & thread safety
Location clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448
Status Resolved

Description: Client.Dotnet-003 upgraded MxGatewayClient._disposed to an int accessed via Interlocked.Exchange / Volatile.Read so a concurrent ThrowIfDisposed cannot observe a stale value. The symmetric GalaxyRepositoryClient._disposed is still a plain unsynchronised bool: DisposeAsync reads if (_disposed) then writes _disposed = true without Interlocked or Volatile, and ThrowIfDisposed does an unsynchronised read. The Galaxy client is publicly IAsyncDisposable and exposes TestConnectionAsync / GetLastDeployTimeAsync / DiscoverHierarchyAsync / WatchDeployEventsAsync as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency.

Recommendation: Mirror Client.Dotnet-003 on GalaxyRepositoryClient: change _disposed to an int, use Interlocked.Exchange(ref _disposed, 1) != 0 in DisposeAsync, and Volatile.Read(ref _disposed) != 0 in ThrowIfDisposed. A duplicated MxGatewaySession-style close-lock drain is unnecessary because GalaxyRepositoryClient does not own a per-call SemaphoreSlim.

Resolution: 2026-05-20 — Changed GalaxyRepositoryClient._disposed from bool to int; DisposeAsync now uses Interlocked.Exchange(ref _disposed, 1) != 0 for the once-only guard and ThrowIfDisposed uses Volatile.Read(ref _disposed) != 0, mirroring the Client.Dotnet-003 fix on MxGatewayClient.

Client.Dotnet-010

Field Value
Severity Low
Category Correctness & logic bugs
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279
Status Resolved

Description: Client.Dotnet-005 fixed the silent Register / AddItem / AddItem2 handle-fallback to reply.ReturnValue.Int32Value inside MxGatewaySession, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. BenchReadBulkAsync (line 638) and BenchStreamEventsAsync (line 896) both do int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value; after a register call, and SmokeAsync (lines 1261 and 1279) passes reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value and the equivalent AddItem?.ItemHandle selector to InvokeForHandleAsync. After EnsureProtocolSuccess + EnsureMxAccessSuccess pass but the worker did not set the typed register / add_item oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents.

Recommendation: Either delegate to the SDK helpers (MxGatewaySession.RegisterAsync / AddItemAsync) which already throw the descriptive MxGatewayException via CreateMissingPayloadException, or replicate the same null-check explicitly in InvokeForHandleAsync and the two bench commands. A unit test that enqueues an Ok reply with no typed payload through FakeCliClient and asserts the smoke / bench commands fail loudly would prevent regression.

Resolution: 2026-05-20 — Added private CLI helpers RequireRegisterServerHandle and RequireAddItemItemHandle (with a shared CreateMissingPayloadException mirroring the SDK-level MxGatewaySession helper) that throw a descriptive MxGatewayException when the typed register / add_item payload is absent on an otherwise-successful reply. Replaced all four ?? reply.ReturnValue.Int32Value fallback sites — BenchReadBulkAsync (line 638), BenchStreamEventsAsync (line 896), and both SmokeAsync selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle.

Client.Dotnet-011

Field Value
Severity Low
Category Concurrency & thread safety
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015
Status Resolved

Description: The new bench-stream-events command (added in commit 1cd51bb) supports --session-count > 1 and runs each session's StreamEvents reader in parallel via openedSessions.Select(RunStreamAsync).ToArray() then Task.WhenAll. Inside the per-session lambda the inner Task.Run-spawned event loop updates two shared DateTime? fields without synchronisation:

if (firstSteadyEventUtc is null)
{
    firstSteadyEventUtc = nowUtc;
}
lastSteadyEventUtc = nowUtc;

The integer counters next to them (steadyEvents, steadyDataChangeEvents, warmupEvents) use Interlocked.Increment, and the latency list uses an explicit lock (latencyLock), so the rest of the loop is data-race-free — but these two DateTime? updates are not. With N parallel sessions a torn read on firstSteadyEventUtc produces a non-deterministic "first event time" and the final steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds can compute a slightly wrong window. The user-visible impact is bench-only (skewed eventsPerSecond / dataChangeEventsPerSecond numbers), and on x64 the 64-bit DateTime field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop.

Recommendation: Either guard the two DateTime? updates with the existing latencyLock (cheapest), use Interlocked.CompareExchange for firstSteadyEventUtc and Volatile.Write for lastSteadyEventUtc, or aggregate per-session in local variables and reduce after Task.WhenAll. The reduce-after approach also fixes a related issue: today a faster session can stomp firstSteadyEventUtc after a slower one already set it.

Resolution: 2026-05-20 — Guarded the firstSteadyEventUtc / lastSteadyEventUtc reads and writes inside the per-session event loop with the existing latencyLock. firstSteadyEventUtc now uses the null-coalescing assignment firstSteadyEventUtc ??= nowUtc; under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after Task.WhenAll (happens-before applies) and stays lock-free.

Client.Dotnet-012

Field Value
Severity Low
Category Code organization & conventions
Location clients/dotnet/MxGateway.Client/MxGateway.Client.csproj, clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj, clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj
Status Resolved

Description: src/Directory.Build.props enforces TreatWarningsAsErrors=true, EnforceCodeStyleInBuild=true, AnalysisLevel=latest, and Deterministic=true for every gateway / worker / contracts project, and CLAUDE.md calls this out as a baseline build property. The .NET client projects live under clients/dotnet/ and there is no Directory.Build.props at clients/ or clients/dotnet/ — so none of those properties apply to MxGateway.Client, MxGateway.Client.Cli, or MxGateway.Client.Tests. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The CSharpStyleGuide.md baseline ("Treat compiler warnings as actionable") and the CLAUDE.md table under "Source Update Workflow" both apply equally to .NET client ("dotnet build clients/dotnet/MxGateway.Client.sln"), but the enforcement floor is missing.

Recommendation: Add clients/dotnet/Directory.Build.props (or clients/Directory.Build.props covering Rust-Cargo siblings is N/A — only clients/dotnet/) carrying the same property set: TreatWarningsAsErrors=true, EnforceCodeStyleInBuild=true, AnalysisLevel=latest, Deterministic=true. Excluding generated code (which already lives under src/MxGateway.Contracts/Generated) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in.

Resolution: 2026-05-20 — Added clients/dotnet/Directory.Build.props mirroring src/Directory.Build.props: LangVersion=latest, Nullable=enable, ImplicitUsings=enable, TreatWarningsAsErrors=true, AnalysisLevel=latest, EnforceCodeStyleInBuild=true, Deterministic=true. The three client .csproj files inherit from it automatically. Re-ran dotnet build clients/dotnet/MxGateway.Client.sln and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated.

Client.Dotnet-013

Field Value
Severity Low
Category Code organization & conventions
Location clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24, clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187, clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6
Status Resolved

Description: Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit:

  • DiscoverHierarchyOptions (the public record) has no <summary> on the type and no XML doc on any of its ten public properties (RootGobjectId, RootTagName, RootContainedPath, MaxDepth, CategoryIds, TemplateChainContains, TagNameGlob, IncludeAttributes, AlarmBearingOnly, HistorizedOnly).
  • The second DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken) overload on GalaxyRepositoryClient is public with no XML doc, while the parameterless overload one method above it carries a full <summary> / <param> block.
  • IMxGatewayCliClient is a public interface in the CLI project with no <summary> on the type (the member docs are present).

This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation.

Recommendation: Add <summary> docs to each undocumented member. For DiscoverHierarchyOptions, the property names map cleanly to the underlying DiscoverHierarchyRequest proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves.

Resolution: 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on DiscoverHierarchyOptions (ten properties, mapped to the underlying DiscoverHierarchyRequest proto fields and noting the root-precedence rule); a <summary>/<param>/<returns> block on the second DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken) overload describing its filter semantics and transparent pagination; and a type-level <summary> on the public IMxGatewayCliClient interface explaining its CLI-only transport role and the production binding.

Client.Dotnet-014

Field Value
Severity Low
Category Testing coverage
Location clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98, clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231
Status Resolved

Description: Client.Dotnet-002 closed a coverage gap where the production retry path (RpcExceptionMxGatewayException mapping by RpcExceptionMapper.Map) was never exercised, by adding a MapTransportExceptions flag to FakeGatewayTransport and a regression test that runs through the wrapped-exception branch. That flag is wired through Translate(...) in OpenSessionAsync / CloseSessionAsync / InvokeAsync, but the new alarm test path is not: FakeGatewayTransport.AcknowledgeAlarmAsync throws the queued exception verbatim (line 219), bypassing Translate. The accompanying MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts Assert.ThrowsAsync<RpcException> — but the production path through GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync (lines 120-134) already calls RpcExceptionMapper.Map, so production callers see MxGatewayAuthenticationException and not RpcException. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from MxGatewayClient.AcknowledgeAlarmAsync cannot regress on the alarm-ack mapping without somebody noticing.

Recommendation: Either route FakeGatewayTransport.AcknowledgeAlarmAsync through the same Translate helper the other RPCs use and add a regression test that enables MapTransportExceptions = true and asserts MxGatewayAuthenticationException; or rename the existing test to make the pass-through shape explicit (e.g. …_SurfacesRpcExceptionFromFakeTransportVerbatim) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for Invoke.

Resolution: 2026-05-20 — Applied both halves of the recommendation. Routed FakeGatewayTransport.AcknowledgeAlarmAsync through the same Translate helper the other RPCs use, so when MapTransportExceptions = true thrown RpcExceptions now run through the production RpcExceptionMapper.Map. Renamed the existing pass-through test to AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled (with an updated comment pinning that this shape only applies when mapping is off), and added a new test AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException that enables mapping and asserts the production-parity MxGatewayAuthenticationException with StatusCode.Unauthenticated. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for Invoke.

Client.Dotnet-015

Field Value
Severity Low
Category Correctness & logic bugs
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236, clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065
Status Resolved

Description: CreateCancellation(arguments, command) calls cancellation.CancelAfter(timeout) for every command except the explicitly long-running galaxy-watch, where timeout is arguments.GetDuration("timeout", TimeSpan.FromSeconds(30)). That same --timeout value is also fed into CreateOptions as DefaultCallTimeout, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both bench-read-bulk and bench-stream-events (introduced in 7db4bff and 1cd51bb) default to --duration-seconds=30 --warmup-seconds=3, which already exceeds the 30 s wall-clock budget; bench-stream-events --session-count=N adds another 750 ms × (N-1) of sessionStartStaggerMs before the measurement window even opens.

A manual invocation such as dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ... therefore cancels mid-window every time: the outer CancellationTokenSource trips at 30 s and the bench's inner await Task.Delay(steadyEnd - warmupStart, cancellationToken) throws an OperationCanceledException before firstSteadyEventUtc/lastSteadyEventUtc are even populated, producing a zero steadyElapsedSeconds / 0 eventsPerSecond JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation.

scripts/bench-read-bulk.ps1 already works around this for bench-read-bulk by computing $callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30) and passing --timeout ${callTimeoutSeconds}s (line 125), so the driver flow is correct. But there is no PowerShell wrapper for bench-stream-events, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers.

Recommendation: Either (a) extend the isLongRunning set in CreateCancellation to include bench-read-bulk and bench-stream-events, so manual invocation defers to caller-supplied --timeout and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor --timeout for the bench commands from duration-seconds + warmup-seconds + headroom the way the PS driver does; or (c) split the --timeout knob into a distinct per-call --call-timeout and outer --wall-clock-timeout and document the two roles. Option (a) is the smallest change and matches the existing galaxy-watch precedent. Add a CLI test that runs bench-read-bulk with --duration-seconds=2 --warmup-seconds=0 --timeout=1s and asserts the bench either errors loudly or completes (today it silently emits zeros).

Resolution: 2026-05-20 — Applied option (a): extended the isLongRunning set in CreateCancellation from command is "galaxy-watch" to command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events", so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies --timeout. A caller-supplied --timeout still flows through to DefaultCallTimeout for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing galaxy-watch precedent and removes the silent zero-throughput failure mode without breaking the scripts/bench-read-bulk.ps1 driver path (which explicitly raises --timeout).

Client.Dotnet-016

Field Value
Severity Low
Category Concurrency & thread safety
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976
Status Resolved

Description: BenchStreamEventsAsync.RunStreamAsync launches the per-session stream reader inside a Task.Run(async () => { ... }, streamCts.Token) and stores the returned task in the local streamTask. The recovery block

await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false);
streamCts.Cancel();
try { await streamTask.ConfigureAwait(false); }
catch (OperationCanceledException) { }
catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { }

only awaits streamTask (and therefore only observes its exception) when Task.Delay returns normally. When the outer cancellationToken cancels during the delay — exactly the case Client.Dotnet-015 makes likely — Task.Delay throws OperationCanceledException and skips both streamCts.Cancel() and the await streamTask. The inner stream task is still alive at that point. The using CancellationTokenSource streamCts = ... on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting OperationCanceledException / mapped MxGatewayException is never observed. The local streamTask reference is dropped as RunStreamAsync unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a TaskScheduler.UnobservedTaskException.

The secondary Grpc.Core.RpcException catch on line 975 is also dead in this code path: the production GrpcMxGatewayClientTransport.StreamEventsAsync always wraps RpcException via RpcExceptionMapper.Map, which returns OperationCanceledException for StatusCode.Cancelled (mapper line 31). So the inner task's cancellation exception is always OperationCanceledException, not RpcException. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case.

Recommendation: Restructure RunStreamAsync so the inner streamTask is always observed. A try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} } shape works (the finally runs even on outer cancellation). Alternatively, hoist streamTask into a local that the outer method's try/finally always awaits before exiting, so the per-session loop becomes await Task.WhenAny(streamTask, Task.Delay(...)) then a guaranteed await streamTask. Drop the now-redundant Grpc.Core.RpcException catch or convert it to catch MxGatewayException for the wrapped shape (and document that it should never fire in production).

Resolution: 2026-05-20 — Restructured RunStreamAsync to wrap the Task.Delay in try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }, so the inner stream task is observed on every path — including when the outer cancellationToken cancels during the delay. Dropped the dead catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) clause (the production GrpcMxGatewayClientTransport.StreamEventsAsync routes through RpcExceptionMapper.Map, which returns OperationCanceledException for StatusCode.Cancelled, so an RpcException never reaches here) and replaced it with catch (MxGatewayException) to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent TaskScheduler.UnobservedTaskException whenever the outer cancellation fires mid-measurement-window.

Client.Dotnet-017

Field Value
Severity Low
Category Error handling & resilience
Location clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262
Status Resolved

Description: Surfaced during the 2026-05-20 cross-language e2e matrix run: dotnet run --project clients/dotnet/MxGateway.Client.Cli -- stream-events --endpoint http://localhost:5120 --api-key-env MXGATEWAY_API_KEY --timeout 60s --json --session-id session-... --max-events 200 exited with -532462766 (unhandled-exception exit code) and propagated System.OperationCanceledException: Call canceled by the client. mapped from Status(StatusCode="Cancelled", …). The CLI's StreamEventsAsync does await foreach (... in client.StreamEventsAsync(...).WithCancellation(cancellationToken)) and never catches OperationCanceledException. When the caller's --timeout (driven by CreateCancellation's CancelAfter) fires before --max-events is reached — the common case for a finite-window event collector against a quiet test rig — the foreach throws, the exception bubbles up, the process exits non-zero, and any --json aggregate output is never written. The other client CLIs (Go, Rust, Python, Java) all exit 0 in this case (e2e clients g/r/p ran clean). The bug is also a strict regression of the CLI's contract: callers can't tell "stream collected 0N events then the budget closed" apart from "the call genuinely failed".

Recommendation: Wrap the await foreach in try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }. The when clause ensures only the supplied cancellation token (which covers --timeout, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with --timeout 1s against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.

Resolution: 2026-05-20 — Wrapped the await foreach in try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { } so the CLI exits 0 and emits the aggregate { "events": [...] } JSON when the supplied token cancels (the --timeout, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's when clause ensures non-token-driven cancellation still propagates. Added regression test MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero that yields two events, parks on the cancellation token via a new FakeCliClient.StreamHangAfterEvents hook, runs the CLI with --timeout 1s --json --max-events 200, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.