Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.
High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
string (it must be a valid SPDX expression), so `pip wheel .` and
`pip install -e .` both fail before any source compiles. Tests
still pass because pytest bypasses the build backend via
`pythonpath`. Dropped the invalid license string, kept the
`License :: Other/Proprietary License` classifier, and added
`tests/test_packaging.py` so a future regression of the same shape
is caught in CI.
Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
on WorkerPipeSessionOptions bounds the in-flight-command watchdog
suppression so a truly stuck COM call still triggers StaHung
instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
cross-language bench comparison is apples-to-apples again;
`failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
serialisation pattern to DeployEventStream so close() arriving
after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
stability check after UnAdvise instead of strict equality against
the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
log sink the WriteSecured live test owns (worker stdout/stderr,
gateway logs, direct WriteLine) so the credential is proven
absent from the full output buffer, not just the diagnostic
message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
for the previously-uncovered Write2Bulk and WriteSecured2Bulk
arms of WriteBulkConstraintPlan.SetPayload.
Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
AlarmsOptions validated at startup (Server-026); Authorization.md
Constraint Enforcement snippet/prose enumerate the bulk write/read
family (Server-027); bulk-read-commands and bulk-write-commands
capability tokens added to OpenSession (Server-029);
NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
guard the poll path uses, at every command entry (Worker-024);
RunAsync null-checks the runtime-session factory result
(Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
CancelCommandReturnValue serialised under lock (Worker.Tests-027);
Probes namespace lifted to MxGateway.Worker.Tests.Probes
(Worker.Tests-029); cancel-envelope sequence numbers monotonised
(Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
(Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
test backed by a TaskCompletionSource fake (Tests-022); companion
FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
(Tests-023); constraint plan reply-count divergence pinned
(Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
end-to-end (IntegrationTests-018); abnormal-exit keyword set
tightened to pipe-disconnected/end-of-stream and the test now
asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
default 30s wall-clock budget doesn't kill them (015);
BenchStreamEventsAsync observes the inner stream task on every
exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
%w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
RFC3339Nano with fractional seconds (019); runStreamEvents installs
signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
cancellation contract Client.Java-015 established (022); stream-events
text path uses Long.toUnsignedString for worker_sequence (023);
bench-read-bulk no longer pollutes success-latency histogram with
failure durations (024); --shutdown-timeout CLI option propagates
through to ClientOptions (025); seven new MxGatewayCliTests cover
the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
wheel-build smoke test added under tests/test_packaging.py (020);
README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
document the AsRef<str> read_bulk genericism (019);
next_correlation_id re-exported at the crate root, with a
property-style doc contract and an explicit disclaimer that the
literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
IConstraintEnforcer mechanism instead of "tag-allowlist filter"
(014); BulkReadResult gains explicit per-arm payload-population
documentation for the success vs failure cases (015).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
@@ -5,7 +5,7 @@
|
||||
| Module | `clients/dotnet` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-20 |
|
||||
| Commit reviewed | `1cd51bb` |
|
||||
| Commit reviewed | `a020350` |
|
||||
| Status | Reviewed |
|
||||
| Open findings | 0 |
|
||||
|
||||
@@ -13,16 +13,16 @@
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | Issue found (this review): the Client.Dotnet-005 fix did not reach the CLI — `BenchReadBulkAsync`, `BenchStreamEventsAsync`, and `SmokeAsync` still fall through to `reply.ReturnValue.Int32Value` for `Register` / `AddItem` handles (Client.Dotnet-010). |
|
||||
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. |
|
||||
| 3 | Concurrency & thread safety | Issues found (this review): `GalaxyRepositoryClient._disposed` is still a plain unsynchronized `bool` (Client.Dotnet-009) — the symmetric fix from Client.Dotnet-003 was applied only to `MxGatewayClient`; the new `bench-stream-events` CLI command races `firstSteadyEventUtc`/`lastSteadyEventUtc` across parallel sessions (Client.Dotnet-011). |
|
||||
| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved). |
|
||||
| 5 | Security | Good — API key never logged by the library, CLI redacts keys (incl. env-var-sourced), TLS custom-root validation correct, secured-write payloads never logged. |
|
||||
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly. |
|
||||
| 1 | Correctness & logic bugs | Issue found (this review): the global CLI `--timeout` defaults to 30 s and is used both as the gRPC `DefaultCallTimeout` and as the outer `CancelAfter` budget — but `bench-read-bulk` / `bench-stream-events` default to `--duration-seconds=30 --warmup-seconds=3 (+ stagger)`, so direct manual invocation cancels the bench mid-window before the steady-state ends (Client.Dotnet-015). The `scripts/bench-read-bulk.ps1` driver works around this by raising `--timeout`, but `bench-stream-events` has no driver script. |
|
||||
| 2 | mxaccessgw conventions | Good — consumes the shared contracts project, no forked proto, `authorization: Bearer` metadata correct, parity preserved via split `EnsureProtocolSuccess`/`EnsureMxAccessSuccess`. The new `clients/dotnet/Directory.Build.props` mirrors `src/Directory.Build.props` exactly (same six properties, identical values) so the enforcement floor is back in scope. |
|
||||
| 3 | Concurrency & thread safety | Issue found (this review): `BenchStreamEventsAsync`'s per-session `RunStreamAsync` hands the inner `Task.Run` stream loop a reference (`streamTask`) that becomes unobserved whenever the outer `cancellationToken` cancels during the bench's `await Task.Delay` — the `await streamTask` recovery path never runs, so any inner OCE / `RpcException` raised after cancellation surfaces as a `TaskScheduler.UnobservedTaskException` (Client.Dotnet-016). The Client.Dotnet-009 / 011 fixes from the previous pass are correctly applied. |
|
||||
| 4 | Error handling & resilience | No new issues found this review (Client.Dotnet-001 and Client.Dotnet-004 remain resolved; `RpcExceptionMapper` is consistently called from both gateway and Galaxy transports incl. `AcknowledgeAlarmAsync` after Client.Dotnet-014). |
|
||||
| 5 | Security | Good — API key never logged by the library, CLI redacts effective key (both `--api-key` and `--api-key-env` sourced) after Client.Dotnet-008, TLS custom-root validation correct, secured-write payloads never logged. |
|
||||
| 6 | Performance & resource management | No issues found — channels and streaming calls disposed correctly, retry pipeline shares one timeout budget per safe-unary op. |
|
||||
| 7 | Design-document adherence | No issues found — matches `DotnetClientDesign.md` and `ClientLibrariesDesign.md`. |
|
||||
| 8 | Code organization & conventions | Issues found (this review): the .NET client projects do not inherit `src/Directory.Build.props` so `TreatWarningsAsErrors` / `EnforceCodeStyleInBuild` / `AnalysisLevel=latest` are silently absent (Client.Dotnet-012); `DiscoverHierarchyOptions` and the `DiscoverHierarchyAsync(DiscoverHierarchyOptions, …)` overload have no XML docs (Client.Dotnet-013). |
|
||||
| 9 | Testing coverage | Issue found (this review): the SDK-level alarm tests pin the fake-transport raw-`RpcException` shape but never exercise the production gRPC-to-native mapping (`GrpcMxGatewayClientTransport.AcknowledgeAlarmAsync`) — the same gap Client.Dotnet-002 closed for `Invoke`, still open for alarms (Client.Dotnet-014). |
|
||||
| 10 | Documentation & comments | No new issues this review. |
|
||||
| 8 | Code organization & conventions | No new issues — Client.Dotnet-012 (Directory.Build.props) and Client.Dotnet-013 (missing XML docs on `DiscoverHierarchyOptions`, the second `DiscoverHierarchyAsync` overload, and `IMxGatewayCliClient`) are both fully resolved; the new props file is a faithful mirror of the production one. |
|
||||
| 9 | Testing coverage | No new issues — Client.Dotnet-014 closed the alarm-side `Translate` gap. The new bench paths (`bench-read-bulk`, `bench-stream-events`) have no unit-test coverage, but they are stress harnesses driven by `scripts/bench-read-bulk.ps1`, not SDK API surface, so this is not flagged. |
|
||||
| 10 | Documentation & comments | No new issues this review (Client.Dotnet-007's alarm-ack `admin`-scope correction holds; `DefaultCallTimeout` doc accurately reflects the shared-budget semantics from Client.Dotnet-004). |
|
||||
|
||||
## Findings
|
||||
|
||||
@@ -251,3 +251,49 @@ This is the same convention-violation shape Client.Dotnet-006 closed; CLAUDE.md
|
||||
**Recommendation:** Either route `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use and add a regression test that enables `MapTransportExceptions = true` and asserts `MxGatewayAuthenticationException`; or rename the existing test to make the pass-through shape explicit (e.g. `…_SurfacesRpcExceptionFromFakeTransportVerbatim`) and add a second test exercising the production mapping. Either fix closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||||
|
||||
**Resolution:** 2026-05-20 — Applied both halves of the recommendation. Routed `FakeGatewayTransport.AcknowledgeAlarmAsync` through the same `Translate` helper the other RPCs use, so when `MapTransportExceptions = true` thrown `RpcException`s now run through the production `RpcExceptionMapper.Map`. Renamed the existing pass-through test to `AcknowledgeAlarmAsync_SurfacesRpcExceptionFromFakeTransportVerbatim_WhenMappingDisabled` (with an updated comment pinning that this shape only applies when mapping is off), and added a new test `AcknowledgeAlarmAsync_MapsUnauthenticated_RpcException_ToTypedException` that enables mapping and asserts the production-parity `MxGatewayAuthenticationException` with `StatusCode.Unauthenticated`. Closes the alarm-side equivalent of the gap Client.Dotnet-002 closed for `Invoke`.
|
||||
|
||||
### Client.Dotnet-015
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:221-236`, `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:596-1065` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `CreateCancellation(arguments, command)` calls `cancellation.CancelAfter(timeout)` for every command except the explicitly long-running `galaxy-watch`, where `timeout` is `arguments.GetDuration("timeout", TimeSpan.FromSeconds(30))`. That same `--timeout` value is also fed into `CreateOptions` as `DefaultCallTimeout`, so the CLI uses one knob for two distinct things: per-call gRPC deadline and overall wall-clock cancellation budget. Both `bench-read-bulk` and `bench-stream-events` (introduced in `7db4bff` and `1cd51bb`) default to `--duration-seconds=30 --warmup-seconds=3`, which already exceeds the 30 s wall-clock budget; `bench-stream-events --session-count=N` adds another `750 ms × (N-1)` of `sessionStartStaggerMs` before the measurement window even opens.
|
||||
|
||||
A manual invocation such as `dotnet run --project clients/dotnet/MxGateway.Client.Cli -- bench-stream-events --endpoint ... --api-key ...` therefore cancels mid-window every time: the outer `CancellationTokenSource` trips at 30 s and the bench's inner `await Task.Delay(steadyEnd - warmupStart, cancellationToken)` throws an `OperationCanceledException` before `firstSteadyEventUtc`/`lastSteadyEventUtc` are even populated, producing a zero `steadyElapsedSeconds` / `0 eventsPerSecond` JSON payload that looks like a backend failure but is a self-inflicted CLI cancellation.
|
||||
|
||||
`scripts/bench-read-bulk.ps1` already works around this for `bench-read-bulk` by computing `$callTimeoutSeconds = [Math]::Max(60, $DurationSeconds + $WarmupSeconds + 30)` and passing `--timeout ${callTimeoutSeconds}s` (line 125), so the driver flow is correct. But there is no PowerShell wrapper for `bench-stream-events`, and the bench is documented (in its own XML summary on line 792) as a single-client harness intended to be run directly. The trap is silent: no error is printed, just suspiciously-small numbers.
|
||||
|
||||
**Recommendation:** Either (a) extend the `isLongRunning` set in `CreateCancellation` to include `bench-read-bulk` and `bench-stream-events`, so manual invocation defers to caller-supplied `--timeout` and otherwise runs until the bench finishes; (b) compute an automatic minimum-floor `--timeout` for the bench commands from `duration-seconds + warmup-seconds + headroom` the way the PS driver does; or (c) split the `--timeout` knob into a distinct per-call `--call-timeout` and outer `--wall-clock-timeout` and document the two roles. Option (a) is the smallest change and matches the existing `galaxy-watch` precedent. Add a CLI test that runs `bench-read-bulk` with `--duration-seconds=2 --warmup-seconds=0 --timeout=1s` and asserts the bench either errors loudly or completes (today it silently emits zeros).
|
||||
|
||||
**Resolution:** 2026-05-20 — Applied option (a): extended the `isLongRunning` set in `CreateCancellation` from `command is "galaxy-watch"` to `command is "galaxy-watch" or "bench-read-bulk" or "bench-stream-events"`, so the two bench commands now run until they finish (or Ctrl+C) by default and only apply a wall-clock budget when the caller explicitly supplies `--timeout`. A caller-supplied `--timeout` still flows through to `DefaultCallTimeout` for per-attempt gRPC deadlines on the unary calls these benches make. Matches the existing `galaxy-watch` precedent and removes the silent zero-throughput failure mode without breaking the `scripts/bench-read-bulk.ps1` driver path (which explicitly raises `--timeout`).
|
||||
|
||||
### Client.Dotnet-016
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Concurrency & thread safety |
|
||||
| Location | `clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:922-976` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** `BenchStreamEventsAsync.RunStreamAsync` launches the per-session stream reader inside a `Task.Run(async () => { ... }, streamCts.Token)` and stores the returned task in the local `streamTask`. The recovery block
|
||||
|
||||
```csharp
|
||||
await Task.Delay(steadyEnd - warmupStart, cancellationToken).ConfigureAwait(false);
|
||||
streamCts.Cancel();
|
||||
try { await streamTask.ConfigureAwait(false); }
|
||||
catch (OperationCanceledException) { }
|
||||
catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled) { }
|
||||
```
|
||||
|
||||
only awaits `streamTask` (and therefore only observes its exception) when `Task.Delay` returns normally. When the outer `cancellationToken` cancels during the delay — exactly the case Client.Dotnet-015 makes likely — `Task.Delay` throws `OperationCanceledException` and skips both `streamCts.Cancel()` and the `await streamTask`. The inner stream task is still alive at that point. The `using CancellationTokenSource streamCts = ...` on line 924 disposes the linked CTS, which propagates cancellation to the inner stream (so it eventually exits), but the resulting `OperationCanceledException` / mapped `MxGatewayException` is never observed. The local `streamTask` reference is dropped as `RunStreamAsync` unwinds, leaving the task object eligible for garbage collection with an unobserved fault — a `TaskScheduler.UnobservedTaskException`.
|
||||
|
||||
The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this code path: the production `GrpcMxGatewayClientTransport.StreamEventsAsync` always wraps `RpcException` via `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled` (mapper line 31). So the inner task's cancellation exception is always `OperationCanceledException`, not `RpcException`. Harmless when the recovery block runs, but it underscores that the cancellation path was only tested for the happy case.
|
||||
|
||||
**Recommendation:** Restructure `RunStreamAsync` so the inner `streamTask` is always observed. A `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }` shape works (the `finally` runs even on outer cancellation). Alternatively, hoist `streamTask` into a local that the outer method's `try`/`finally` always awaits before exiting, so the per-session loop becomes `await Task.WhenAny(streamTask, Task.Delay(...))` then a guaranteed `await streamTask`. Drop the now-redundant `Grpc.Core.RpcException` catch or convert it to catch `MxGatewayException` for the wrapped shape (and document that it should never fire in production).
|
||||
|
||||
**Resolution:** 2026-05-20 — Restructured `RunStreamAsync` to wrap the `Task.Delay` in `try { await Task.Delay(...) } finally { streamCts.Cancel(); try { await streamTask } catch (OperationCanceledException) {} catch (MxGatewayException) {} }`, so the inner stream task is observed on every path — including when the outer `cancellationToken` cancels during the delay. Dropped the dead `catch (Grpc.Core.RpcException ex) when (ex.StatusCode is Grpc.Core.StatusCode.Cancelled)` clause (the production `GrpcMxGatewayClientTransport.StreamEventsAsync` routes through `RpcExceptionMapper.Map`, which returns `OperationCanceledException` for `StatusCode.Cancelled`, so an `RpcException` never reaches here) and replaced it with `catch (MxGatewayException)` to absorb the wrapped shape for any non-cancellation mapper output. Added an inline comment naming the finding and documenting why the new catch shape is correct. Eliminates the latent `TaskScheduler.UnobservedTaskException` whenever the outer cancellation fires mid-measurement-window.
|
||||
|
||||
Reference in New Issue
Block a user