Files

T

Joseph Doherty 712cb06442 Resolve Client.Dotnet-018..021: README + bench-read-bulk hardening

Client.Dotnet-018  README CLI examples for stream-alarms / acknowledge-alarm
                   replaced with parser-correct flags; new theory test
                   parses each documented README example through the CLI.
Client.Dotnet-019  BenchReadBulkAsync routes through new
                   RequireRegisterServerHandle helper that fails loudly when
                   the OK register reply has no typed payload.
Client.Dotnet-020  Bench steady-state catch is now
                   catch (Exception ex) when (ex is not OperationCanceledException)
                   so user-driven cancellation exits promptly.
Client.Dotnet-021  --timeout-ms now flows through ParseTimeoutMs which
                   rejects negatives with a clear error in both read-bulk
                   and bench-read-bulk.

All resolved at 2026-05-24; 67/67 .NET client tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

2026-05-24 08:49:45 -04:00

63 KiB

Raw Blame History

Code Review — Client.Dotnet

Field	Value
Module	`clients/dotnet`
Reviewer	Claude Code
Review date	2026-05-24
Commit reviewed	`42b0037`
Status	Re-reviewed
Open findings	0

Checklist coverage

#	Category	Result
1	Correctness & logic bugs	Issue found (this review): the global CLI `--timeout` defaults to 30 s and is used both as the gRPC `DefaultCallTimeout` and as the outer `CancelAfter` budget — but `bench-read-bulk` / `bench-stream-events` default to `--duration-seconds=30 --warmup-seconds=3 (+ stagger)`, so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The `scripts/bench-read-bulk.ps1` driver works around this by raising `--timeout`, but `bench-stream-events` has no driver script.
2	mxaccessgw conventions	Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. The new `clients/dotnet/Directory.Build.props` mirrors `src/Directory.Build.props` exactly (same six properties, identical values) so the enforcement floor is back in scope.
3	Concurrency & thread safety	Issue found (this review): `BenchStreamEventsAsync`'s per-session `RunStreamAsync` hands the inner `Task.Run` stream loop a reference (`streamTask`) that becomes unobserved whenever the outer `cancellationToken` cancels during the bench's `await Task.Delay` — the `await streamTask` recovery path never runs, so any inner OCE / `RpcException` raised after cancellation surfaces as a `TaskScheduler.UnobservedTaskException` (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied.
4	Error handling & resilience	No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; `RpcExceptionMapper` is consistently called from both gateway and Galaxy transports incl. `AcknowledgeAlarmAsync` after Client.Dotnet-014).
5	Security	Good — API key never logged by the library, CLI redacts effective key (both `--api-key` and `--api-key-env` sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged.
6	Performance & resource management	No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op.
7	Design-document adherence	No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`.
8	Code organization & conventions	No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on `DiscoverHierarchyOptions`, the second `DiscoverHierarchyAsync` overload, and `IMxGatewayCliClient`) are both fully resolved; the new props file is a faithful mirror of the production one.
9	Testing coverage	No new issues — Client.Dotnet-014 closed the alarm-side `Translate` gap. The new bench paths (`bench-read-bulk`, `bench-stream-events`) have no unit-test coverage, but they are stress harnesses driven by `scripts/bench-read-bulk.ps1`, not SDK API surface, so this is not flagged.
10	Documentation & comments	No new issues this review (Client.Dotnet-007's alarm-ack `admin`-scope correction holds; `DefaultCallTimeout` doc accurately reflects the shared-budget semantics from Client.Dotnet-004).

2026-05-24 review (commit `d692232`)

Re-review pass at d692232. Diff against a020350 consists of the ZB.MOM.WW client prefix rename in commit 397d3c5 (folders, csprojs, sln→slnx, every namespace and using) plus the hand-written DiscoverHierarchyOptions.cs POCO and the dropped retired SessionId = lines from alarm-related test fixtures. The rename was applied via a case-insensitive regex sweep; no over-rename artifacts found. The mxgw_* API-key wire prefix, MXGATEWAY_* environment variables, and the MxGatewayClient / MxGatewaySession type names are unchanged. Build and tests are green at HEAD.

#	Category	Result
1	Correctness & logic bugs	No issues found in the a020350..d692232 diff.
2	mxaccessgw conventions	No issues found — rename hygiene clean; wire identifiers preserved.
3	Concurrency & thread safety	No issues found in this diff.
4	Error handling & resilience	No issues found in this diff.
5	Security	No issues found in this diff.
6	Performance & resource management	No issues found in this diff.
7	Design-document adherence	No issues found in this diff — `DotnetClientDesign.md` reflects the new layout.
8	Code organization & conventions	No issues found in this diff.
9	Testing coverage	No issues found in this diff — `MxGatewayClientAlarmsTests` fixtures correctly drop `SessionId` from `AcknowledgeAlarmRequest`/`Reply` and retain it on `QueryActiveAlarmsRequest`.
10	Documentation & comments	No issues found in this diff.

Findings

Client.Dotnet-001

Field	Value
Severity	Medium
Category	Error handling & resilience
Location	`clients/dotnet/MxGateway.Client/GrpcMxGatewayClientTransport.cs:190-199`, `clients/dotnet/MxGateway.Client/GrpcGalaxyRepositoryClientTransport.cs:131-140`
Status	Resolved

Description: MapRpcException only produces typed exceptions for Unauthenticated and PermissionDenied. Every other gRPC status — NotFound, InvalidArgument, ResourceExhausted, FailedPrecondition, Unavailable, Internal — collapses into the base MxGatewayException with no surfaced StatusCode. Callers cannot programmatically distinguish a transient outage from a permanent bad-argument error without reflecting into InnerException and downcasting to RpcException.

Recommendation: Carry the gRPC StatusCode on MxGatewayException (e.g. a StatusCode property) and/or add typed subclasses for at least NotFound, InvalidArgument, and Unavailable. Populate it from exception.StatusCode in MapRpcException.

Resolution: (2026-05-18) Confirmed against source: both transports had a duplicated private MapRpcException that only typed two statuses and discarded the gRPC code for the rest. Added a nullable StatusCode property (Grpc.Core.StatusCode?) to MxGatewayException plus constructors that carry it, threaded it through MxGatewayAuthenticationException/MxGatewayAuthorizationException, and extracted the two duplicated mappers into a single shared internal RpcExceptionMapper (RpcExceptionMapper.cs) that populates StatusCode from exception.StatusCode for every status. Callers can now distinguish transient from permanent failures without downcasting InnerException. Documented in clients/dotnet/README.md. Regression test: RpcExceptionMapperTests (8 cases incl. the [Theory] over NotFound/InvalidArgument/ResourceExhausted/FailedPrecondition/Unavailable/Internal).

Client.Dotnet-002

Field	Value
Severity	Medium
Category	Testing coverage
Location	`clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:145-148`, `clients/dotnet/MxGateway.Client.Tests/MxGatewayClientSessionTests.cs:236-256`
Status	Resolved

Description: The retry predicate MxGatewayClientRetryPolicy.IsTransientGrpcFailure handles two shapes: a raw RpcException and an MxGatewayException { InnerException: RpcException }. In production the transport always maps RpcException → MxGatewayException before it reaches the retry pipeline, so only the wrapped-MxGatewayException branch ever runs in production. But FakeGatewayTransport throws the raw RpcException and never maps it, so every retry test exercises only the raw-RpcException branch — the branch that never occurs in production. The production retry behaviour is effectively untested.

Recommendation: Add a fake/transport mode that maps RpcException to MxGatewayException the way GrpcMxGatewayClientTransport does (or add tests that enqueue a pre-wrapped MxGatewayException), so the actually-used predicate branch is covered.

Resolution: (2026-05-18) Confirmed against source: FakeGatewayTransport threw queued exceptions verbatim, so the existing retry tests only ever hit the raw-RpcException predicate branch. Added a MapTransportExceptions flag to FakeGatewayTransport that, when set, runs thrown RpcExceptions through the same shared RpcExceptionMapper the production gRPC transport uses, producing the wrapped MxGatewayException shape. Added regression test MxGatewayClientSessionTests.InvokeAsync_RetriesSafeDiagnosticCommand_WhenTransportMapsRpcException, which exercises the previously-untested production predicate branch. Verified red: removing the MxGatewayException { InnerException: RpcException } case from IsTransientGrpcFailure fails the new test while the pre-existing raw-RpcException test still passes.

Client.Dotnet-003

Field	Value
Severity	Medium
Category	Concurrency & thread safety
Location	`clients/dotnet/MxGateway.Client/MxGatewaySession.cs:659-663`, `clients/dotnet/MxGateway.Client/MxGatewayClient.cs:230-240`
Status	Resolved

Description: DisposeAsync calls CloseAsync() (no token) then unconditionally _closeLock.Dispose(). If another thread is concurrently awaiting CloseAsync(token) — legal, since the type exposes public async methods and no single-threaded contract — disposing the SemaphoreSlim while a WaitAsync is pending throws ObjectDisposedException into that caller. The _disposed flags in both clients are also plain unsynchronised bool reads/writes; ThrowIfDisposed racing DisposeAsync can observe a stale value.

Recommendation: Either document MxGatewaySession/MxGatewayClient as not thread-safe for concurrent dispose, or guard _disposed with Interlocked/volatile and avoid disposing _closeLock until all in-flight CloseAsync calls complete.

Resolution: (2026-05-18) Confirmed against source: MxGatewaySession.DisposeAsync disposed _closeLock unconditionally, racing concurrent CloseAsync callers; MxGatewayClient._disposed was a plain bool. Fixed MxGatewaySession by tracking in-flight CloseAsync callers with an _activeCloseCount guarded by a dedicated _disposeGate lock and a _closeLockDisposed flag: CloseAsync registers under the gate (and throws ObjectDisposedException if disposal already won) before awaiting _closeLock.WaitAsync, and DisposeAsync drains _activeCloseCount to zero before disposing the semaphore, so the close lock provably outlives every pending WaitAsync. Fixed MxGatewayClient by changing _disposed to an int accessed via Interlocked.Exchange/Volatile.Read. Regression test MxGatewayClientSessionTests.DisposeAsync_DoesNotRaceConcurrentCloseAsync runs 100 iterations with one close holding the lock and one parked behind it while DisposeAsync runs concurrently; verified red against the original DisposeAsync (fails with ObjectDisposedException), green after the fix.

Client.Dotnet-004

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`clients/dotnet/MxGateway.Client/MxGatewayClient.cs:283-294`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:392-403`
Status	Resolved

Description: ExecuteSafeUnaryAsync wraps the whole Polly retry pipeline in a single linked CTS cancelled after Options.DefaultCallTimeout, while CreateCallOptions also stamps each individual call with a DefaultCallTimeout gRPC deadline. The retry pipeline therefore shares one DefaultCallTimeout budget across the initial attempt plus all retries plus backoff delays. The README/XML docs describe DefaultCallTimeout as a per-call timeout, which misrepresents this. DeadlineExceeded is also classified as transient, so an attempt that exhausts the shared budget is retried only to immediately fail again.

Recommendation: Decide whether DefaultCallTimeout is per-attempt or per-operation and make code and docs consistent — e.g. a separate per-attempt deadline and a distinct overall-operation timeout. Reconsider retrying on DeadlineExceeded when the deadline was client-imposed.

Resolution: (2026-05-18) Confirmed against source: the shared linked-CTS budget plus per-call deadline both use DefaultCallTimeout, and IsTransientStatus listed DeadlineExceeded. Resolved as a per-operation budget (the simpler, non-breaking choice): the DefaultCallTimeout XML doc in MxGatewayClientOptions.cs now states it is both the per-attempt gRPC deadline and the overall budget shared across the initial attempt, every retry, and the backoff delays — an upper bound on total wall-clock time, not a fresh per-retry allowance. Removed DeadlineExceeded from MxGatewayClientRetryPolicy.IsTransientStatus: every unary deadline is client-imposed (CreateCallOptions stamps the shared budget), so a DeadlineExceeded means the budget is exhausted and an immediate retry can only fail again. Regression test MxGatewayClientSessionTests.InvokeAsync_DoesNotRetrySafeDiagnosticCommand_OnDeadlineExceeded asserts the safe diagnostic command (Ping) is attempted exactly once and the failure surfaces; verified red against the original transient set (the call retried and succeeded).

Client.Dotnet-005

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/MxGateway.Client/MxGatewaySession.cs:82,124,175`
Status	Resolved

Description: RegisterAsync/AddItemAsync/AddItem2Async return reply.<Typed>?.ServerHandle ?? reply.ReturnValue.Int32Value. After EnsureMxAccessSuccess() passes, a missing typed payload silently falls back to ReturnValue.Int32Value, which for a reply carrying no return value is 0. A caller then uses 0 as a ServerHandle/ItemHandle, producing a confusing downstream invalid-handle failure rather than a clear "gateway reply missing payload" error.

Recommendation: If the typed sub-message is the contract for these commands, treat its absence on an otherwise-successful reply as an error (throw a descriptive MxGatewayException) rather than falling through to ReturnValue.Int32Value.

Resolution: (2026-05-18) Confirmed against source and mxaccess_gateway.proto: register/add_item/add_item2 are members of the MxCommandReply.payload oneof, so the typed accessor is null whenever the worker did not set that case — and the fallback returned ReturnValue.Int32Value (0 for a reply with no return value). The typed sub-message is the contract for these handle-returning commands, so its absence on an otherwise-successful reply is now an error: RegisterAsync/AddItemAsync/AddItem2Async throw via a new private MxGatewaySession.CreateMissingPayloadException helper that builds a descriptive MxGatewayException naming the missing payload, kind, session, and correlation id. Regression tests MxGatewayClientSessionTests.RegisterAsync_Throws_WhenSuccessfulReplyMissingPayload and AddItemAsync_Throws_WhenSuccessfulReplyMissingPayload enqueue an Ok reply with no typed payload and assert the descriptive throw; verified red against the original fallback (returned 0 instead of throwing).

Client.Dotnet-006

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`clients/dotnet/MxGateway.Client/MxGatewayClientOptions.cs:50`, `clients/dotnet/MxGateway.Client/MxGatewayClientContractInfo.cs:10-14`
Status	Resolved

Description: MxGatewayClientOptions.MaxGrpcMessageBytes and the two consts in MxGatewayClientContractInfo are public members with no XML doc comments, inconsistent with every other public member in the assembly and with the repo's documented C# style emphasis on a documented public surface.

Recommendation: Add <summary> doc comments to MaxGrpcMessageBytes, GatewayProtocolVersion, and WorkerProtocolVersion.

Resolution: (2026-05-18) Confirmed: all three public members lacked XML docs while every other public member in the assembly is documented. Added <summary> comments to MxGatewayClientOptions.MaxGrpcMessageBytes (describing the 16 MiB default applied to both send and receive limits), and to MxGatewayClientContractInfo.GatewayProtocolVersion and WorkerProtocolVersion (describing their wire-compatibility / diagnostics purpose). Pure documentation change — no test needed; build remains warning-clean.

Client.Dotnet-007

Field	Value
Severity	Low
Category	Documentation & comments
Location	`clients/dotnet/MxGateway.Client/MxGatewayClient.cs:185-192`
Status	Resolved

Description: The AcknowledgeAlarmAsync XML comment states the gateway authenticates against an invoke:alarm-ack scope, but CLAUDE.md documents the scope set without any invoke:alarm-ack sub-scope. The comment may describe an intended finer-grained scope that does not exist, misleading integrators about what API key they need.

Recommendation: Reconcile the comment with the actual server-side scope check, or update the scope documentation if sub-scopes were genuinely added; keep client doc and gateway auth model in sync.

Resolution: (2026-05-18) Confirmed against the server-side authorization model: GatewayGrpcScopeResolver.ResolveRequiredScope has no arm for AcknowledgeAlarmRequest, so it falls to the _ => GatewayScopes.Admin default — the RPC actually requires the admin scope. No invoke:alarm-ack sub-scope exists anywhere in GatewayScopes. The client XML comment on AcknowledgeAlarmAsync was wrong, not the docs. Corrected the comment to state the gateway authorizes AcknowledgeAlarmRequest against the API key's admin scope and that there is no finer-grained alarm-ack sub-scope. Pure documentation change — no test needed.

Client.Dotnet-008

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayCliSecretRedactor.cs:9-17`
Status	Resolved

Description: The CLI redactor only removes the API key string when it was supplied via --api-key; RunCoreAsync passes arguments.GetOptional("api-key") to Redact. When the key comes from an environment variable (--api-key-env, the documented default path), apiKey is null and no redaction occurs. If a gRPC/transport error message ever echoes the bearer token, it would be printed unredacted.

Recommendation: Resolve the effective API key (same logic as ResolveApiKey) before redacting, so the env-var-sourced key is also stripped from error output.

Resolution: (2026-05-18) Confirmed against source: MxGatewayClientCli.RunCoreAsync's catch block redacted only arguments.GetOptional("api-key"), so an env-var-sourced key (--api-key-env, default MXGATEWAY_API_KEY) was never stripped. Note MxGatewayCliSecretRedactor itself is correct — the defect was the caller passing the wrong value. Extracted a non-throwing TryResolveApiKey helper (used by both the existing ResolveApiKey and the catch block) that resolves --api-key then the --api-key-env environment variable; the catch block now redacts that effective key. Updated clients/dotnet/README.md (smoke paragraph) to state the CLI redacts the effective key whether from --api-key or --api-key-env. Regression test MxGatewayClientCliTests.RunAsync_ErrorOutput_RedactsApiKey_WhenSourcedFromEnvironmentVariable sets a test env var, forces a transport error echoing the key, and asserts the key is absent and [redacted] is present; verified red against the original GetOptional("api-key")-only redaction (key printed unredacted).

Client.Dotnet-009

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:26,339-348,445-448`
Status	Resolved

Description: Client.Dotnet-003 upgraded MxGatewayClient._disposed to an int accessed via Interlocked.Exchange / Volatile.Read so a concurrent ThrowIfDisposed cannot observe a stale value. The symmetric GalaxyRepositoryClient._disposed is still a plain unsynchronised bool: DisposeAsync reads if (_disposed) then writes _disposed = true without Interlocked or Volatile, and ThrowIfDisposed does an unsynchronised read. The Galaxy client is publicly IAsyncDisposable and exposes TestConnectionAsync / GetLastDeployTimeAsync / DiscoverHierarchyAsync / WatchDeployEventsAsync as legal-to-call-concurrently public APIs, so a concurrent dispose can produce the same torn-read race the gateway client fix prevented. The two clients also exhibit the same shape (gRPC channel + transport + retry pipeline), so the divergence is an accidental inconsistency.

Recommendation: Mirror Client.Dotnet-003 on GalaxyRepositoryClient: change _disposed to an int, use Interlocked.Exchange(ref _disposed, 1) != 0 in DisposeAsync, and Volatile.Read(ref _disposed) != 0 in ThrowIfDisposed. A duplicated MxGatewaySession-style close-lock drain is unnecessary because GalaxyRepositoryClient does not own a per-call SemaphoreSlim.

Resolution: 2026-05-20 — Changed GalaxyRepositoryClient._disposed from bool to int; DisposeAsync now uses Interlocked.Exchange(ref _disposed, 1) != 0 for the once-only guard and ThrowIfDisposed uses Volatile.Read(ref _disposed) != 0, mirroring the Client.Dotnet-003 fix on MxGatewayClient.

Client.Dotnet-010

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:638,896,1261,1279`
Status	Resolved

Description: Client.Dotnet-005 fixed the silent Register / AddItem / AddItem2 handle-fallback to reply.ReturnValue.Int32Value inside MxGatewaySession, but the same fallback pattern was left in the CLI and is now also present in two new bench commands shipped after that fix. BenchReadBulkAsync (line 638) and BenchStreamEventsAsync (line 896) both do int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value; after a register call, and SmokeAsync (lines 1261 and 1279) passes reply => reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value and the equivalent AddItem?.ItemHandle selector to InvokeForHandleAsync. After EnsureProtocolSuccess + EnsureMxAccessSuccess pass but the worker did not set the typed register / add_item oneof case, all four call sites silently produce a zero handle and proceed to drive the rest of the smoke / bench against an invalid handle — exactly the failure mode the SDK-level fix prevents.

Recommendation: Either delegate to the SDK helpers (MxGatewaySession.RegisterAsync / AddItemAsync) which already throw the descriptive MxGatewayException via CreateMissingPayloadException, or replicate the same null-check explicitly in InvokeForHandleAsync and the two bench commands. A unit test that enqueues an Ok reply with no typed payload through FakeCliClient and asserts the smoke / bench commands fail loudly would prevent regression.

Resolution: 2026-05-20 — Added private CLI helpers RequireRegisterServerHandle and RequireAddItemItemHandle (with a shared CreateMissingPayloadException mirroring the SDK-level MxGatewaySession helper) that throw a descriptive MxGatewayException when the typed register / add_item payload is absent on an otherwise-successful reply. Replaced all four ?? reply.ReturnValue.Int32Value fallback sites — BenchReadBulkAsync (line 638), BenchStreamEventsAsync (line 896), and both SmokeAsync selectors (lines 1261, 1279) — with these helpers, so the CLI now fails loudly with the same shape as the SDK helpers rather than silently driving the rest of the command against a zero handle.

Client.Dotnet-011

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:857-858,922-963,1014-1015`
Status	Resolved

Description: The new bench-stream-events command (added in commit 1cd51bb) supports --session-count > 1 and runs each session's StreamEvents reader in parallel via openedSessions.Select(RunStreamAsync).ToArray() then Task.WhenAll. Inside the per-session lambda the inner Task.Run-spawned event loop updates two shared DateTime? fields without synchronisation:

if (firstSteadyEventUtc is null)
{
    firstSteadyEventUtc = nowUtc;
}
lastSteadyEventUtc = nowUtc;

The integer counters next to them (steadyEvents, steadyDataChangeEvents, warmupEvents) use Interlocked.Increment, and the latency list uses an explicit lock (latencyLock), so the rest of the loop is data-race-free — but these two DateTime? updates are not. With N parallel sessions a torn read on firstSteadyEventUtc produces a non-deterministic "first event time" and the final steadyElapsedSeconds = (lastSteadyEventUtc.Value - firstSteadyEventUtc.Value).TotalSeconds can compute a slightly wrong window. The user-visible impact is bench-only (skewed eventsPerSecond / dataChangeEventsPerSecond numbers), and on x64 the 64-bit DateTime field read/write happens to be atomic, so this is Low — but the pattern is inconsistent with the rest of the same loop.

Recommendation: Either guard the two DateTime? updates with the existing latencyLock (cheapest), use Interlocked.CompareExchange for firstSteadyEventUtc and Volatile.Write for lastSteadyEventUtc, or aggregate per-session in local variables and reduce after Task.WhenAll. The reduce-after approach also fixes a related issue: today a faster session can stomp firstSteadyEventUtc after a slower one already set it.

Resolution: 2026-05-20 — Guarded the firstSteadyEventUtc / lastSteadyEventUtc reads and writes inside the per-session event loop with the existing latencyLock. firstSteadyEventUtc now uses the null-coalescing assignment firstSteadyEventUtc ??= nowUtc; under the lock so a slower session can't stomp an earlier already-set value. The lock is already held by the latency-list append a few lines below, so the extra cost is one uncontended acquisition per event. The final read in the stats block runs after Task.WhenAll (happens-before applies) and stays lock-free.

Client.Dotnet-012

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`clients/dotnet/MxGateway.Client/MxGateway.Client.csproj`, `clients/dotnet/MxGateway.Client.Cli/MxGateway.Client.Cli.csproj`, `clients/dotnet/MxGateway.Client.Tests/MxGateway.Client.Tests.csproj`
Status	Resolved

Description: src/Directory.Build.props enforces TreatWarningsAsErrors=true, EnforceCodeStyleInBuild=true, AnalysisLevel=latest, and Deterministic=true for every gateway / worker / contracts project, and CLAUDE.md calls this out as a baseline build property. The .NET client projects live under clients/dotnet/ and there is no Directory.Build.props at clients/ or clients/dotnet/ — so none of those properties apply to MxGateway.Client, MxGateway.Client.Cli, or MxGateway.Client.Tests. New warnings in the client do not break the build, and code-style violations are not blocked at build time. The CSharpStyleGuide.md baseline ("Treat compiler warnings as actionable") and the CLAUDE.md table under "Source Update Workflow" both apply equally to .NET client ("dotnet build clients/dotnet/MxGateway.Client.sln"), but the enforcement floor is missing.

Recommendation: Add clients/dotnet/Directory.Build.props (or clients/Directory.Build.props covering Rust-Cargo siblings is N/A — only clients/dotnet/) carrying the same property set: TreatWarningsAsErrors=true, EnforceCodeStyleInBuild=true, AnalysisLevel=latest, Deterministic=true. Excluding generated code (which already lives under src/MxGateway.Contracts/Generated) is automatic because the client only references the contracts project. Build the client locally after adding it to confirm no warnings already snuck in.

Resolution: 2026-05-20 — Added clients/dotnet/Directory.Build.props mirroring src/Directory.Build.props: LangVersion=latest, Nullable=enable, ImplicitUsings=enable, TreatWarningsAsErrors=true, AnalysisLevel=latest, EnforceCodeStyleInBuild=true, Deterministic=true. The three client .csproj files inherit from it automatically. Re-ran dotnet build clients/dotnet/MxGateway.Client.sln and confirmed 0 warnings / 0 errors — no pre-existing warnings were silently being tolerated.

Client.Dotnet-013

Field	Value
Severity	Low
Category	Code organization & conventions
Location	`clients/dotnet/MxGateway.Client/DiscoverHierarchyOptions.cs:3-24`, `clients/dotnet/MxGateway.Client/GalaxyRepositoryClient.cs:185-187`, `clients/dotnet/MxGateway.Client.Cli/IMxGatewayCliClient.cs:6`
Status	Resolved

Description: Client.Dotnet-006 fixed three undocumented public members. Three more remain undocumented in code paths the prior review didn't visit:

DiscoverHierarchyOptions (the public record) has no <summary> on the type and no XML doc on any of its ten public properties (RootGobjectId, RootTagName, RootContainedPath, MaxDepth, CategoryIds, TemplateChainContains, TagNameGlob, IncludeAttributes, AlarmBearingOnly, HistorizedOnly).
The second DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken) overload on GalaxyRepositoryClient is public with no XML doc, while the parameterless overload one method above it carries a full <summary> / <param> block.
IMxGatewayCliClient is a public interface in the CLI project with no <summary> on the type (the member docs are present).

This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md style guidance describes XML docs on the public surface as the baseline expectation.

Recommendation: Add <summary> docs to each undocumented member. For DiscoverHierarchyOptions, the property names map cleanly to the underlying DiscoverHierarchyRequest proto fields — a one-line summary per property and a type-level summary tying the record to the Galaxy hierarchy browse is enough. The CLI interface only needs a type-level summary; the members already document themselves.

Resolution: 2026-05-20 — Added XML docs to all three call sites: a type-level summary plus a one-line summary per property on DiscoverHierarchyOptions (ten properties, mapped to the underlying DiscoverHierarchyRequest proto fields and noting the root-precedence rule); a <summary>/<param>/<returns> block on the second DiscoverHierarchyAsync(DiscoverHierarchyOptions, CancellationToken) overload describing its filter semantics and transparent pagination; and a type-level <summary> on the public IMxGatewayCliClient interface explaining its CLI-only transport role and the production binding.

Client.Dotnet-014

Field	Value
Severity	Low
Category	Testing coverage
Location	`clients/dotnet/MxGateway.Client.Tests/MxGatewayClientAlarmsTests.cs:76-98`, `clients/dotnet/MxGateway.Client.Tests/FakeGatewayTransport.cs:212-231`
Status	Resolved

Description: Client.Dotnet-002 closed a coverage gap where the production retry path (RpcException → MxGatewayException mapping by RpcExceptionMapper.Map) was never exercised, by adding a MapTransportExceptions flag to FakeGatewayTransport and a regression test that runs through the wrapped-exception branch. That flag is wired through Translate(...) in OpenSessionAsync / CloseSessionAsync / InvokeAsync, but the new alarm test path is not: FakeGatewayTransport.AcknowledgeAlarmAsync throws the queued exception verbatim (line 219), bypassing Translate. The accompanying MxGatewayClientAlarmsTests.AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException test acknowledges this in a comment ("Note: the FakeGatewayTransport surfaces RpcException directly … the SDK-level test pins the pass-through shape so a future migration to direct mapping won't silently change observable behaviour") and asserts Assert.ThrowsAsync<RpcException> — but the production path through GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync (lines 120-134) already calls RpcExceptionMapper.Map, so production callers see MxGatewayAuthenticationException and not RpcException. The test name advertises mapping that the SDK-level harness doesn't exercise, and any callable from MxGatewayClient.AcknowledgeAlarmAsync cannot regress on the alarm-ack mapping without somebody noticing.

Recommendation: Either route FakeGatewayTransport.AcknowledgeAlarmAsync through the same Translate helper the other RPCs use and add a regression test that enables MapTransportExceptions = true and asserts MxGatewayAuthenticationException; or rename the existing test to make the pass-through shape explicit (e.g. …_SurfacesRpcExceptionFromFakeTransportVerbatim) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for Invoke.

Resolution: 2026-05-20 — Applied both halves of the recommendation. Routed FakeGatewayTransport.AcknowledgeAlarmAsync through the same Translate helper the other RPCs use, so when MapTransportExceptions = true thrown RpcExceptions now run through the production RpcExceptionMapper.Map. Renamed the existing pass-through test to AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled (with an updated comment pinning that this shape only applies when mapping is off), and added a new test AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException that enables mapping and asserts the production-parity MxGatewayAuthenticationException with StatusCode.Unauthenticated. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for Invoke.

Client.Dotnet-015

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065`
Status	Resolved

Description: CreateCancellation(arguments, command) calls cancellation.CancelAfter(timeout) for every command except the explicitly long-running galaxy-watch, where timeout is arguments.GetDuration("timeout", TimeSpan.FromSeconds(30)). That same --timeout value is also fed into CreateOptions as DefaultCallTimeout, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both bench-read-bulk and bench-stream-events (introduced in 7db4bff and 1cd51bb) default to --duration-seconds=30 --warmup-seconds=3, which already exceeds the 30 s wall-clock budget; bench-stream-events --session-count=N adds another 750 ms × (N-1) of sessionStartStaggerMs before the measurement window even opens.

A manual invocation such as dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ... therefore cancels mid-window every time: the outer CancellationTokenSource trips at 30 s and the bench's inner await Task.Delay(steadyEnd - warmupStart, cancellationToken) throws an OperationCanceledException before firstSteadyEventUtc/lastSteadyEventUtc are even populated, producing a zero steadyElapsedSeconds / 0 eventsPerSecond JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation.

scripts/bench-read-bulk.ps1 already works around this for bench-read-bulk by computing $callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30) and passing --timeout ${callTimeoutSeconds}s (line 125), so the driver flow is correct. But there is no PowerShell wrapper for bench-stream-events, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers.

Recommendation: Either (a) extend the isLongRunning set in CreateCancellation to include bench-read-bulk and bench-stream-events, so manual invocation defers to caller-supplied --timeout and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor --timeout for the bench commands from duration-seconds + warmup-seconds + headroom the way the PS driver does; or (c) split the --timeout knob into a distinct per-call --call-timeout and outer --wall-clock-timeout and document the two roles. Option (a) is the smallest change and matches the existing galaxy-watch precedent. Add a CLI test that runs bench-read-bulk with --duration-seconds=2 --warmup-seconds=0 --timeout=1s and asserts the bench either errors loudly or completes (today it silently emits zeros).

Resolution: 2026-05-20 — Applied option (a): extended the isLongRunning set in CreateCancellation from command is "galaxy-watch" to command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events", so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies --timeout. A caller-supplied --timeout still flows through to DefaultCallTimeout for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing galaxy-watch precedent and removes the silent zero-throughput failure mode without breaking the scripts/bench-read-bulk.ps1 driver path (which explicitly raises --timeout).

Client.Dotnet-016

Field	Value
Severity	Low
Category	Concurrency & thread safety
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976`
Status	Resolved

Description: BenchStreamEventsAsync.RunStreamAsync launches the per-session stream reader inside a Task.Run(async () => { ... }, streamCts.Token) and stores the returned task in the local streamTask. The recovery block

await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false);
streamCts.Cancel();
try { await streamTask.ConfigureAwait(false); }
catch (OperationCanceledException) { }
catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { }

only awaits streamTask (and therefore only observes its exception) when Task.Delay returns normally. When the outer cancellationToken cancels during the delay — exactly the case Client.Dotnet-015 makes likely — Task.Delay throws OperationCanceledException and skips both streamCts.Cancel() and the await streamTask. The inner stream task is still alive at that point. The using CancellationTokenSource streamCts = ... on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting OperationCanceledException / mapped MxGatewayException is never observed. The local streamTask reference is dropped as RunStreamAsync unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a TaskScheduler.UnobservedTaskException.

The secondary Grpc.Core.RpcException catch on line 975 is also dead in this code path: the production GrpcMxGatewayClientTransport.StreamEventsAsync always wraps RpcException via RpcExceptionMapper.Map, which returns OperationCanceledException for StatusCode.Cancelled (mapper line 31). So the inner task's cancellation exception is always OperationCanceledException, not RpcException. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case.

Recommendation: Restructure RunStreamAsync so the inner streamTask is always observed. A try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} } shape works (the finally runs even on outer cancellation). Alternatively, hoist streamTask into a local that the outer method's try/finally always awaits before exiting, so the per-session loop becomes await Task.WhenAny(streamTask, Task.Delay(...)) then a guaranteed await streamTask. Drop the now-redundant Grpc.Core.RpcException catch or convert it to catch MxGatewayException for the wrapped shape (and document that it should never fire in production).

Resolution: 2026-05-20 — Restructured RunStreamAsync to wrap the Task.Delay in try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }, so the inner stream task is observed on every path — including when the outer cancellationToken cancels during the delay. Dropped the dead catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) clause (the production GrpcMxGatewayClientTransport.StreamEventsAsync routes through RpcExceptionMapper.Map, which returns OperationCanceledException for StatusCode.Cancelled, so an RpcException never reaches here) and replaced it with catch (MxGatewayException) to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent TaskScheduler.UnobservedTaskException whenever the outer cancellation fires mid-measurement-window.

Client.Dotnet-017

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:1190-1262`
Status	Resolved

Description: Surfaced during the 2026-05-20 cross-language e2e matrix run: dotnet run --project clients/dotnet/MxGateway.Client.Cli -- stream-events --endpoint http://localhost:5120 --api-key-env MXGATEWAY_API_KEY --timeout 60s --json --session-id session-... --max-events 200 exited with -532462766 (unhandled-exception exit code) and propagated System.OperationCanceledException: Call canceled by the client. mapped from Status(StatusCode="Cancelled", …). The CLI's StreamEventsAsync does await foreach (... in client.StreamEventsAsync(...).WithCancellation(cancellationToken)) and never catches OperationCanceledException. When the caller's --timeout (driven by CreateCancellation's CancelAfter) fires before --max-events is reached — the common case for a finite-window event collector against a quiet test rig — the foreach throws, the exception bubbles up, the process exits non-zero, and any --json aggregate output is never written. The other client CLIs (Go, Rust, Python, Java) all exit 0 in this case (e2e clients g/r/p ran clean). The bug is also a strict regression of the CLI's contract: callers can't tell "stream collected 0–N events then the budget closed" apart from "the call genuinely failed".

Recommendation: Wrap the await foreach in try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }. The when clause ensures only the supplied cancellation token (which covers --timeout, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with --timeout 1s against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.

Resolution: 2026-05-20 — Wrapped the await foreach in try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { } so the CLI exits 0 and emits the aggregate { "events": [...] } JSON when the supplied token cancels (the --timeout, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's when clause ensures non-token-driven cancellation still propagates. Added regression test MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero that yields two events, parks on the cancellation token via a new FakeCliClient.StreamHangAfterEvents hook, runs the CLI with --timeout 1s --json --max-events 200, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.

2026-05-24 re-review (commit `42b0037`)

Re-review pass at 42b0037. Diff against d692232 consists of four commits:

71d2c39 — ports the batch subcommand to all five client CLIs (the .NET half adds RunBatchAsync, the __MXGW_BATCH_EOR__ sentinel, the forceJsonErrors plumbing on RunCoreAsync, and the two RunAsync_Batch_* tests).
b3ae200 — ports the bulk read/write SDK methods + matching CLI subcommands (the SDK side adds WriteBulkAsync / Write2BulkAsync / WriteSecuredBulkAsync / WriteSecured2BulkAsync / ReadBulkAsync on MxGatewaySession; the CLI side adds read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, and the cross-language stress harness bench-read-bulk).
11cc671 — ports stream-alarms and acknowledge-alarm to the CLI and the SDK transport (StreamAlarmsAsync on MxGatewayClient, IMxGatewayClientTransport, GrpcMxGatewayClientTransport, IMxGatewayCliClient, MxGatewayCliClientAdapter, and FakeGatewayTransport; two new MxGatewayClientCliTests covering acknowledge + stream-alarms with --max-events and payload-case distinction). The same commit wraps StreamEventsAsync's await foreach with catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) (matching the Client.Dotnet-017 resolution shape) and mirrors the catch on the new StreamAlarmsAsync CLI handler. RunBatchAsync's outer catch is catch (Exception exception) with no when guard — OCE thrown out of RunCoreAsync is caught here and written as a JSON error so the batch process keeps going, while the inner RunCoreAsync catch retains its when (exception is not OperationCanceledException) guard so the non-batch CLI behaviour is unchanged.
8738735 — updates each client's README with stream-alarms / acknowledge-alarm usage and SDK examples.

#	Category	Result
1	Correctness & logic bugs	Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021).
2	mxaccessgw conventions	Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018).
3	Concurrency & thread safety	No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`.
4	Error handling & resilience	Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly.
5	Security	No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths.
6	Performance & resource management	No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category).
7	Design-document adherence	No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less.
8	Code organization & conventions	No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern.
9	Testing coverage	No new issues — `RunAsync_StreamAlarms_`, `RunAsync_AcknowledgeAlarm_`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here.
10	Documentation & comments	Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete.

Client.Dotnet-018

Field	Value
Severity	Medium
Category	Documentation & comments
Location	`clients/dotnet/README.md:137-138`
Status	Resolved

Description: The README example block for the two new alarm CLI subcommands shipped in commit 11cc671 shows:

mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json
mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json

None of these flags exist. The actual CLI consumes --filter-prefix, --max-events, [--jsonl] for stream-alarms and --reference, --comment, --operator, --json for acknowledge-alarm (see WriteUsage lines 891-892 and the StreamAlarmsAsync / AcknowledgeAlarmAsync handlers in MxGatewayClientCli.cs). Neither command takes a --session-id — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on MxGatewayClient.StreamAlarmsAsync lines 226-237). A user who copies the README example will get a --session-id "unknown option" failure on both commands and an "Argument 'reference' is required" failure on acknowledge-alarm (since --alarm-reference is silently treated as --alarm-reference rather than --reference).

This is exactly the docs-in-the-same-change rule from CLAUDE.md ("Update docs in the same change as the source") — the README change in 8738735 documents the wrong surface for the source change in 11cc671. The built-in WriteUsage text is correct; only the README walkthrough is wrong.

Recommendation: Replace the two example lines with the actual flag names:

mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json
mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json

A quick sanity check would be to drive each example through the test harness's MxGatewayClientCli.RunAsync shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.

Resolution: 2026-05-24 — Confirmed against source: the README dotnet run -- stream-alarms example used --session-id <id> --max-messages 1, neither of which the CLI accepts (the production parser routes through --filter-prefix / --max-events), and acknowledge-alarm used --session-id <id> --alarm-reference <ref> against a session-less central alarm monitor that actually consumes --reference. Replaced both README example lines with the parser-correct shape: stream-alarms --filter-prefix Area001 --max-events 1 --json and acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json. Regression test MxGatewayClientCliTests.RunAsync_ReadmeExamples_ForAlarmCommands_ParseSuccessfully (xUnit [Theory] over stream-alarms and acknowledge-alarm) locates clients/dotnet/README.md, extracts the documented CLI command line for each subcommand, tokenizes it (preserving quoted segments), and drives it through the production MxGatewayClientCli.RunAsync against a FakeCliClient; the test asserts exit code 0 and that stderr contains neither "Unknown command" nor "Missing required option". Verified red against the original README text (both theory rows failed with exit code 1) and green after the README update, so any future documentation drift between README examples and the actual parser shape is caught at test time.

Client.Dotnet-019

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745`
Status	Resolved

Description: Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value, where a successful protocol+MX-status reply missing its typed register oneof case falls through to ReturnValue.Int32Value and silently yields 0 for the handle. The new BenchReadBulkAsync handler introduced in commit b3ae200 reinstates exactly that pattern at line 745:

int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;

The bench then drives the rest of the run — SubscribeBulk, warmup ReadBulk, steady-state ReadBulk, UnsubscribeBulk — against a zero ServerHandle. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading cachedReadResults = 0 summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level RequireRegisterServerHandle / RequireAddItemItemHandle helper was introduced, but neither helper currently exists in the codebase, and the SDK-level MxGatewaySession.RegisterAsync / AddItemAsync / AddItem2Async at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance.

Recommendation: Replace the fallback with an explicit null-check on registerReply.Register that throws MxGatewayException with the missing-payload context (kind = Register, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in MxGatewaySession are restored to throw, route the bench through MxGatewaySession.RegisterAsync instead so the CLI inherits the SDK's protection.

Resolution: 2026-05-24 — Confirmed against source: BenchReadBulkAsync at line 745 carried the silent fallback registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value, so a successful protocol+MX-status reply missing its typed register payload would yield ServerHandle=0 and the bench would drive the rest of SubscribeBulk / warmup / steady-state ReadBulk / UnsubscribeBulk against an invalid handle. Added a private RequireRegisterServerHandle(MxCommandReply reply, string sessionId) helper that throws a descriptive MxGatewayException naming the missing typed payload, the session id, and the correlation id, then replaced the fallback site with int serverHandle = RequireRegisterServerHandle(registerReply, sessionId);. The bench's outer RunCoreAsync catch then surfaces the throw as exit code 1 plus the descriptive message on stderr, so the failure mode is loud rather than a wall of zero-result stats. Regression test MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenRegisterReplyMissingTypedPayload_FailsLoudly enqueues an Ok MX command reply with no typed Register payload, runs bench-read-bulk against the fake, and asserts exit code 1, that stderr names the Register payload, and that no bench-read-bulk stats JSON was emitted on stdout. Verified red against the original fallback (an NRE bubbled up later in the run rather than the descriptive throw) and green after the helper landed.

Client.Dotnet-020

Field	Value
Severity	Low
Category	Error handling & resilience
Location	`clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780`
Status	Resolved

Description: BenchReadBulkAsync's steady-state while (DateTime.UtcNow < steadyDeadline) loop wraps each client.InvokeAsync(...) in a bare catch:

try
{
    reply = await client.InvokeAsync(
            CreateCommandRequest(sessionId, readBulkMxCommand),
            cancellationToken)
        .ConfigureAwait(false);
    sw.Stop();
}
catch
{
    sw.Stop();
    failedCalls++;
    latencyMillis.Add(sw.Elapsed.TotalMilliseconds);
    continue;
}

The catch has no type filter and no when (!cancellationToken.IsCancellationRequested) clause, so OperationCanceledException flowing from a cancelled cancellationToken (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until DateTime.UtcNow reaches steadyDeadline, with each iteration immediately throwing OCE from the next InvokeAsync call — a tight CPU-burn until the configured --duration-seconds elapses, producing a per-iteration latencyMillis.Add(sw.Elapsed.TotalMilliseconds) flood that also skews the final p99/max numbers down.

The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the finally block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on cancellationToken.IsCancellationRequested. Note that Client.Dotnet-015's resolution removed the outer CancelAfter(timeout) for bench-read-bulk / bench-stream-events, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit.

Recommendation: Replace the bare catch with catch (Exception) when (!cancellationToken.IsCancellationRequested), or split into catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs bench-read-bulk with a --duration-seconds 10 budget against a fake that throws on every InvokeAsync, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's BenchStreamEventsAsync cancellation hardening — should already cover the shape of the test fixture.

Resolution: 2026-05-24 — Confirmed against source: the steady-state loop wrapped client.InvokeAsync in a bare catch { sw.Stop(); failedCalls++; latencyMillis.Add(...); continue; } with no type filter, so an OperationCanceledException thrown from a cancelled token (Ctrl+C, parent CTS, or the wall-clock budget) was swallowed and the loop spun until DateTime.UtcNow >= steadyDeadline. Replaced the bare clause with catch (Exception ex) when (ex is not OperationCanceledException), so OCE now propagates out of the bench and unwinds through the outer RunCoreAsync (which intentionally lets OCE escape). Inline comment names the finding and the reason the filter exists. Regression test MxGatewayClientCliTests.RunAsync_BenchReadBulk_WhenSteadyStateLoopReceivesCancellation_ExitsPromptly configures the fake CLI client so Register and SubscribeBulk succeed, the first three ReadBulk calls succeed (the loop enters the steady-state body), then every subsequent ReadBulk throws OperationCanceledException; the test runs bench-read-bulk --duration-seconds 30 and asserts the call throws OCE and wall-clock elapsed time stays under 10 s. Verified red against the original bare catch (the test ran for the full 30 s without throwing) and green after the filter landed (the bench exited promptly with OCE).

Client.Dotnet-021

Field	Value
Severity	Low
Category	Correctness & logic bugs
Location	`clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715`
Status	Resolved

Description: Both new bulk-read CLI handlers cast a signed --timeout-ms argument to uint without bounds checking:

// ReadBulkAsync (line 487)
TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0),

// BenchReadBulkAsync (line 715)
uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500);

arguments.GetInt32(...) returns Int32.Parse(...), so a caller supplying --timeout-ms=-1 (an easy mistake for "no timeout" or copy-paste from another tool that uses -1 for unbounded) silently wraps to 0xFFFFFFFF = ~49.7 days as the worker-side ReadBulk timeout. The gateway forwards that as the per-tag timeout_ms on ReadBulkCommand, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's uint32 timeout_ms field was meant to prevent. The ReadBulkAsync SDK overload on MxGatewaySession (line 663) already guards correctly: timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue); the CLI handlers should match.

Recommendation: Wrap the cast in a guard, e.g.:

int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0);
if (timeoutMsRaw < 0)
{
    throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default).");
}
uint timeoutMs = (uint)timeoutMsRaw;

A single shared helper (e.g. ParseTimeoutMs(CliArguments, string, int)) on MxGatewayClientCli would cover both call sites and remove the duplication.

Resolution: 2026-05-24 — Confirmed against source: both ReadBulkAsync (line 490) and BenchReadBulkAsync (line 715) cast arguments.GetInt32("timeout-ms", ...) straight to uint, so --timeout-ms -1 silently wrapped to 0xFFFFFFFF (~49.7 days). Added a single shared private helper ParseTimeoutMs(CliArguments arguments, int defaultValue) on MxGatewayClientCli that reads the int32, rejects negatives with a clear ArgumentException ("--timeout-ms must be a non-negative integer (use 0 for the gateway default)."), and returns the safe (uint). Both call sites now route through the helper. Regression test MxGatewayClientCliTests.RunAsync_TimeoutMs_NegativeValue_RejectsWithClearError (xUnit [Theory] over read-bulk and bench-read-bulk) drives the CLI with --timeout-ms -1 and asserts the exit code is non-zero, that stderr contains "timeout-ms", and that the "non-negative" guard text is present. Verified red against the original (uint)arguments.GetInt32(...) casts (the bench proceeded past the timeout parse and tripped a downstream "Queue empty" error rather than the descriptive guard message) and green after the helper landed.

63 KiB Raw Blame History Unescape Escape