Files
mxaccessgw/code-reviews/Client.Rust/findings.md
T
Joseph Doherty 1aafd6bde4 Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings
Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00

39 KiB

Code Review — Client.Rust

Field Value
Module clients/rust
Reviewer Claude Code
Review date 2026-05-20
Commit reviewed a020350
Status Reviewed
Open findings 0

Checklist coverage

This re-review (a020350) covers the resolution work for Client.Rust-013 through 017 (scoped doc_lazy_continuation allow on generated submodules, pub next_correlation_id shared with the CLI, success/failure split in bench-read-bulk, eight new tests, design-doc resync). The pass spot-checked the items called out in the request: stability of the newly-pub next_correlation_id, the bench-read-bulk JSON shape vs the PowerShell driver, presence of unsafe, and the scope of #![allow(clippy::doc_lazy_continuation)]. cargo clippy --workspace --all-targets -- -D warnings and cargo test --workspace both pass on this commit.

# Category Result
1 Correctness & logic bugs No issues found — the five new MalformedReply paths and the read_bulk mismatched-payload branch each have a dedicated test; BenchReadBulkStats correctly partitions success vs failure latency.
2 mxaccessgw conventions No issues found — cargo clippy --workspace --all-targets -- -D warnings and cargo test --workspace both pass on this commit; the #![allow(clippy::doc_lazy_continuation)] allow is scoped narrowly to each generated v1 inner module so hand-written code is unaffected; CLI Ping/CloseSession now call session::next_correlation_id.
3 Concurrency & thread safety No issues found — CORRELATION_SEQUENCE is AtomicU64 with Relaxed, correct for monotonic id generation; no unsafe anywhere in src/ or crates/.
4 Error handling & resilience Issue found: the bench-read-bulk fix for Client.Rust-015 has fixed Rust's own histogram honestly but the change makes Rust's latencyMs semantically incompatible with the four other clients' latencyMs field that the cross-language PowerShell driver collates side-by-side (Client.Rust-018).
5 Security No issues found — API keys still redacted in Debug/Display, status messages scrubbed, first_failure records Error::Display (which already redacts mxgw_* tokens) so secure-write values cannot leak into the bench JSON.
6 Performance & resource management No issues found in the reviewed delta.
7 Design-document adherence Issue found: RustClientDesign.md Session signatures for the four bulk-write helpers and read_bulk do not match the actual implementation — the design lists trailing user_id / timestamp / current_user_id / verifier_user_id parameters and a Vec<ReadBulkResult> return that the impl does not have (all of those move per-entry into WriteBulkEntry etc.) (Client.Rust-019).
8 Code organization & conventions No new issues — BenchReadBulkStats is cleanly factored out and tested.
9 Testing coverage No new issues — the malformed-reply paths and unary Error::Unavailable are now covered, and the four bulk-write families each have round-trip smoke.
10 Documentation & comments Issue found: next_correlation_id is now pub and its doc comment commits the SDK to the literal "rust-client-{label}-{N}" string format, but neither the doc nor lib.rs re-exports it or declares any stability stance, leaving the public surface ambiguous (Client.Rust-020).

Findings

Client.Rust-001

Field Value
Severity High
Category mxaccessgw conventions
Location clients/rust/src/options.rs:98,143
Status Resolved

Description: with_max_grpc_message_bytes and max_grpc_message_bytes have no /// doc comments. The crate sets #![warn(missing_docs)] and CLAUDE.md mandates that cargo clippy --workspace --all-targets -- -D warnings pass. Under -D warnings these become hard errors, so clippy fails to compile the crate — breaking the documented build/test workflow for the module.

Recommendation: Add doc comments to both methods, e.g. /// Maximum encoded/decoded gRPC message size in bytes (default 16 MiB).

Resolution: Resolved in 0d8a28d (2026-05-18): doc comments added to both methods.

Client.Rust-002

Field Value
Severity High
Category mxaccessgw conventions
Location clients/rust/src/session.rs:522
Status Resolved

Description: The BulkReplyKind enum's variants (AddItemBulk, AdviseItemBulk, RemoveItemBulk, UnAdviseItemBulk, SubscribeBulk, UnsubscribeBulk) all share the Bulk suffix, tripping clippy::enum_variant_names. Under -D warnings this is a compile error, so cargo clippy --workspace --all-targets -- -D warnings fails — a violation of the CLAUDE.md requirement that clippy pass cleanly.

Recommendation: Rename the variants to drop the common suffix (e.g. AddItem, AdviseItem, …) or add a narrowly-scoped #[allow(clippy::enum_variant_names)] with a reason comment.

Resolution: Resolved in 0d8a28d (2026-05-18): variants renamed to AddItem/AdviseItem/RemoveItem/UnAdviseItem/Subscribe/Unsubscribe, which no longer share a common suffix.

Client.Rust-003

Field Value
Severity High
Category Correctness & logic bugs
Location clients/rust/crates/mxgw-cli/src/main.rs:1051
Status Resolved

Description: The unit test version_json_output_has_protocol_versions asserts value["gatewayProtocolVersion"] == 2, but GATEWAY_PROTOCOL_VERSION is 3 (version.rs:10), matching the authoritative server constant GatewayContractInfo.GatewayProtocolVersion = 3. The test fails, so cargo test --workspace (the documented test step) does not pass — the test was not updated when the protocol version was bumped.

Recommendation: Update the assertion to 3, or better, assert against GATEWAY_PROTOCOL_VERSION so it cannot drift again.

Resolution: Resolved in 0d8a28d (2026-05-18): the test now asserts against the GATEWAY_PROTOCOL_VERSION / WORKER_PROTOCOL_VERSION constants, so it cannot drift again.

Client.Rust-004

Field Value
Severity Low
Category Documentation & comments
Location clients/rust/src/version.rs:7
Status Resolved

Description: CLIENT_VERSION is "0.1.0-dev" and its doc comment claims "Mirrors Cargo.toml", but Cargo.toml declares version = "0.1.0" (no -dev suffix). The comment is misleading and the value is not actually kept in sync with the manifest.

Recommendation: Either set CLIENT_VERSION from the build via env!("CARGO_PKG_VERSION"), or correct the constant to "0.1.0" and drop the "Mirrors Cargo.toml" claim.

Resolution: Resolved in 0d8a28d (2026-05-18): CLIENT_VERSION is now env!("CARGO_PKG_VERSION"), taken from Cargo.toml at compile time so the two cannot drift.

Client.Rust-005

Field Value
Severity Medium
Category Correctness & logic bugs
Location clients/rust/src/session.rs:489-520
Status Resolved

Description: register_server_handle, add_item_handle, and add_item2_handle fall through to reply.return_value … .unwrap_or_default(), returning 0 when the reply carries neither the expected typed payload nor an Int32 return_value. Because Session::invoke has already confirmed protocol_status == Ok, a malformed-but-OK reply silently yields handle 0, which the caller then uses as a real handle against the worker.

Recommendation: Return Err(Error::ProtocolStatus { … }) (or a dedicated Error::MalformedReply) when an OK reply lacks an extractable handle, instead of defaulting to 0.

Resolution: Resolved in 0d8a28d (2026-05-18): the three handle extractors now return Result<i32, Error> and yield the new Error::MalformedReply when an OK reply carries no usable handle.

Client.Rust-006

Field Value
Severity Medium
Category Error handling & resilience
Location clients/rust/src/session.rs:531-555
Status Resolved

Description: bulk_results returns Vec::new() for any (payload, kind) combination that does not match the expected arm — including an OK reply carrying the wrong or no payload. A caller of subscribe_bulk/add_item_bulk then sees an empty result vector and cannot distinguish "zero items processed" from "gateway returned a shapeless reply".

Recommendation: Treat a missing/mismatched bulk payload on an OK reply as an error rather than an empty vector, or document the empty-vec fallback explicitly and log it.

Resolution: Resolved in 0d8a28d (2026-05-18): bulk_results now returns Result<Vec<SubscribeResult>, Error> and yields Error::MalformedReply on a mismatched or absent bulk payload.

Client.Rust-007

Field Value
Severity Low
Category Design-document adherence
Location clients/rust/RustClientDesign.md:14-55
Status Resolved

Description: RustClientDesign.md is stale relative to the implemented code. It documents a nested crates/mxgateway-client/ layout (the real crate root is clients/rust/ with a flat src/), and lists tracing among "Expected dependencies", but tracing appears in no Cargo.toml. CLAUDE.md requires docs to change with the source.

Recommendation: Update RustClientDesign.md to the actual flat layout and remove tracing from the dependency list (or add tracing if structured logging is genuinely intended).

Resolution: Resolved in 0d8a28d (2026-05-18): the "Crate Layout" section now shows the actual flat layout (mxgateway-client as the workspace-root crate, mxgw-cli as a member) and the unused tracing entry was removed from the dependency list.

Client.Rust-008

Field Value
Severity Low
Category Performance & resource management
Location clients/rust/src/value.rs:161-261
Status Resolved

Description: MxValueProjection::from_proto and MxArrayProjection::from_proto deep-clone every element out of the wire message while MxValue/MxArrayValue also retain the original raw message. Every MxValue therefore holds two copies of its payload, wasteful for large string arrays or raw blobs arriving on the event stream.

Recommendation: Compute the projection lazily on demand, or have the projection borrow from raw, so array/raw payloads are not duplicated for every wrapped value.

Resolution: Resolved in 0d8a28d (2026-05-18): MxValue and MxArrayValue no longer cache a projection field — projection() computes the typed view on demand from raw. A value built only to be sent over the wire now holds a single copy of its payload and pays no projection cost.

Client.Rust-009

Field Value
Severity Low
Category Testing coverage
Location clients/rust/tests/client_behavior.rs, clients/rust/src/galaxy.rs
Status Resolved

Description: Several critical paths are untested: TLS channel setup (with_plaintext(false) / CA-file loading), mid-stream tonic::Status fault propagation through EventStream/DeployEventStream (tests only send Ok items), and the bulk-size cap (ensure_bulk_size rejecting >1000 items).

Recommendation: Add tests that (a) feed an Err(Status) into the event/deploy streams and assert it surfaces as the mapped Error, (b) assert add_item_bulk with 1001 items returns Error::InvalidArgument, and (c) exercise the CA-file/InvalidEndpoint error path.

Resolution: Resolved in 0d8a28d (2026-05-18): added add_item_bulk_rejects_input_above_the_thousand_item_cap, event_stream_surfaces_a_mid_stream_status_fault (the fake gateway now optionally emits a mid-stream Status::unavailable), and connect_with_unreadable_ca_file_reports_invalid_endpoint.

Client.Rust-010

Field Value
Severity Low
Category Error handling & resilience
Location clients/rust/src/client.rs:255-268, clients/rust/src/galaxy.rs:204-216
Status Resolved

Description: The client applies only a per-call deadline via Request::set_timeout and has no retry, reconnect, or transient-vs-permanent classification. A transient Unavailable (e.g. a gateway restart) maps to the catch-all Error::Status and is indistinguishable from a permanent failure. This is an acceptable v1 stance but is undocumented.

Recommendation: Either add a documented Error::Unavailable variant classifying Code::Unavailable/Code::ResourceExhausted, or explicitly document in the README that the client performs no retries and that transient failures arrive as Error::Status.

Resolution: Resolved in 0d8a28d (2026-05-18): added the Error::Unavailable variant; From<tonic::Status> maps Code::Unavailable and Code::ResourceExhausted to it, so callers can classify transient failures without unwrapping the raw status.

Client.Rust-011

Field Value
Severity Low
Category mxaccessgw conventions
Location clients/rust/src/session.rs:469
Status Resolved

Description: command_request hard-codes client_correlation_id as format!("rust-client-{}", kind.as_str_name()). Every invocation of the same command kind on a session uses an identical correlation id, so the id cannot correlate a specific request/reply pair in gateway logs or among concurrent in-flight calls. MXAccess parity diagnostics rely on correlation ids being unique per call.

Recommendation: Append a per-call unique suffix (monotonic counter or UUID) to the correlation id, or expose a way for the caller to supply one.

Resolution: Resolved in 0d8a28d (2026-05-18): correlation ids are built by next_correlation_id, which appends a process-wide atomic sequence number; Session::close uses it too.

Client.Rust-012

Field Value
Severity High
Category mxaccessgw conventions
Location clients/rust/src/galaxy.rs:282
Status Resolved

Description: Found while verifying the fix for Client.Rust-001/002: cargo clippy --workspace --all-targets -- -D warnings reported a third violation the original review missed. The get_last_deploy_time test fake calls .clone() on a MutexGuard<Option<prost_types::Timestamp>>, and Option<Timestamp> is Copy (clippy::clone_on_copy). Under -D warnings this is a compile error, so clippy still did not pass after Client.Rust-001/002 alone.

Recommendation: Dereference instead of cloning: *self.state.last_deploy.lock().unwrap().

Resolution: Resolved in 0d8a28d (2026-05-18): replaced .clone() with a deref. cargo clippy --workspace --all-targets -- -D warnings now passes cleanly.

Client.Rust-013

Field Value
Severity High
Category mxaccessgw conventions
Location src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:414-424 (origin); clients/rust/src/generated.rs:11-31 (suppression site)
Status Resolved

Description: cargo clippy --workspace --all-targets -- -D warnings fails again on this commit, this time on a clippy::doc_lazy_continuation violation in generated code:

error: doc list item without indentation
   --> .../mxaccess_gateway.v1.rs:526:5
    |
526 | /// `timeout_ms == 0` uses the gateway-configured default (1000 ms).
    |     ^

The lint fires because the ReadBulkCommand proto comment (added with the bulk Read feature in commit 5e375f6) writes a bulleted list and then a trailing paragraph without the required blank line. prost-build forwards the proto comment verbatim into Rust doc comments, and the Rust client compiles those generated modules with crate-default lints. The crate already opts out of clippy::large_enum_variant in src/generated.rs for exactly this kind of generator-style problem, but doc_lazy_continuation is not on the allow-list, so the lint reaches -D warnings and breaks the documented cargo clippy --workspace --all-targets -- -D warnings invocation that CLAUDE.md mandates pass. The Rust client review was previously closed as clippy-clean (Client.Rust-001/002/012); this is the third clippy-clean regression caused by generated code in this module and warrants a more durable fix.

Recommendation: Add #![allow(clippy::doc_lazy_continuation)] to each generated submodule in clients/rust/src/generated.rs alongside clippy::large_enum_variant, so generated doc comments — which the client cannot edit — cannot break the -D warnings build. Independently, fix the upstream proto comment to insert a blank line before the trailing paragraph so the C# / Go / Python / Java generators do not carry the same flaky text.

Resolution: 2026-05-20 — Added #![allow(clippy::doc_lazy_continuation)] to each generated submodule in clients/rust/src/generated.rs next to the existing clippy::large_enum_variant allow, and reformatted the ReadBulkCommand proto comment in src/MxGateway.Contracts/Protos/mxaccess_gateway.proto to surround the bulleted list with blank lines so doc-comment generators in every language see a properly-terminated list. cargo clippy --workspace --all-targets -- -D warnings and cargo test --workspace now pass, and dotnet build src/MxGateway.Contracts/MxGateway.Contracts.csproj reports 0 warnings.

Client.Rust-014

Field Value
Severity Low
Category mxaccessgw conventions
Location clients/rust/crates/mxgw-cli/src/main.rs:450,497
Status Resolved

Description: Client.Rust-011 made Session build unique correlation ids per call, but the mxgw CLI's Ping and CloseSession subcommands still hard-code client_correlation_id: "rust-cli-ping".to_owned() and "rust-cli-close-session".to_owned(). Both go through client.invoke(…) / client.close_session_raw(…) rather than the Session helpers, so the library's id generator does not run. The CLI is the cross-language e2e driver — when the same machine runs concurrent CLI smokes, every ping/close-session request collides on the same correlation id in gateway logs, defeating the diagnostic value the library fix unlocked.

Recommendation: Either (a) expose session::next_correlation_id as a pub(crate) or library-level helper and have the CLI call it from Ping/CloseSession, or (b) replace these RPCs with the higher-level Session helpers (Session::close, and a thin Session::ping wrapper) so the CLI shares the library's correlation-id discipline by construction.

Resolution: 2026-05-20 — Promoted session::next_correlation_id from a module-private helper to a pub library-level function (it already lived in the pub mod session) and updated the mxgw CLI's Ping and CloseSession subcommands to call mxgateway_client::session::next_correlation_id("cli-ping") / next_correlation_id("cli-close-session") instead of the hard-coded "rust-cli-ping" / "rust-cli-close-session" strings. Concurrent CLI smokes now produce unique correlation ids per call — driven by the same process-wide CORRELATION_SEQUENCE AtomicU64 the library uses — so gateway logs can tell collisions apart again. cargo fmt, cargo build --workspace, cargo clippy --workspace --all-targets -- -D warnings, and cargo test --workspace all pass.

Client.Rust-015

Field Value
Severity Medium
Category Error handling & resilience
Location clients/rust/crates/mxgw-cli/src/main.rs:1053-1070
Status Resolved

Description: The new cross-language benchmark bench-read-bulk pushes the elapsed time of every read_bulk call into latencies_ms regardless of whether the call returned Ok or Err:

let outcome = session.read_bulk(server_handle, &tags, timeout_ms).await;
let elapsed_ms = call_start.elapsed().as_secs_f64() * 1000.0;
latencies_ms.push(elapsed_ms);
match outcome {
    Ok(results) => { successful_calls += 1;  }
    Err(_) => failed_calls += 1,
}

A failed read_bulk (transient Unavailable, deadline-exceeded mid-call, etc.) typically returns later than a successful one — it includes the full per-call timeout that the success path never waits for. The histogram therefore conflates "p99 cached-read latency" with "p99 of (cached-read + timed-out call)", and the JSON document the PowerShell driver collates publishes latencyMs.p99 / latencyMs.max that no longer represent successful-call latency. Worse, the failure category is silently dropped (Err(_) => failed_calls += 1) so a benchmark run that fails on every call still emits a coherent-looking JSON without ever surfacing why. This is misleading for a benchmark whose JSON shape is the cross-language comparison contract.

Recommendation: Only push elapsed time into latencies_ms on Ok, or split into two histograms (successLatencyMs and failureLatencyMs) and log the first failure's error string into the stats record so a partial-failure run is visible at the report layer.

Resolution: 2026-05-20 — Extracted the per-iteration accounting in bench-read-bulk into a BenchReadBulkStats helper with explicit record_success/record_failure methods. Successful read_bulk calls now flow into success_latencies_ms (driving the cross-language latencyMs.p99/max JSON contract), failures flow into a separate failure_latencies_ms histogram surfaced as failureLatencyMs, and the first failure's redacted error string is stashed as firstFailure so a partial-failure run is visible at the report layer instead of producing a coherent-looking JSON that hides every error. Added a unit test (bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram) that records two fast successes plus a deliberately slow failure and asserts the success histogram never sees the failure latency, plus a smaller smoke test for the zero-duration calls-per-second path.

Client.Rust-016

Field Value
Severity Medium
Category Testing coverage
Location clients/rust/tests/client_behavior.rs, clients/rust/src/session.rs:489-519,654-768
Status Resolved

Description: The fixes for Client.Rust-005 / 006 added five new Error::MalformedReply paths to session.rs (register_server_handle, add_item_handle, add_item2_handle, bulk_results, bulk_write_results) plus the inline branch in read_bulk. None of them are exercised by tests — every test in client_behavior.rs feeds the matching payload back to the client, so the malformed-reply branches are dead code from the test suite's perspective. The new bulk-write helpers (write_bulk, write2_bulk, write_secured_bulk, write_secured2_bulk) only have a single happy-path assertion via write_bulk, leaving the three other variants and every per-entry-failure shape untested. The bench-read-bulk flow has no test (the driver script is the only consumer). The Error::Unavailable variant from Client.Rust-010 is covered by event_stream_surfaces_a_mid_stream_status_fault, but the same variant on a unary Code::Unavailable is not.

Recommendation: Add three light tests against the existing FakeGateway:

  1. Have the fake reply to AddItem (or Register / AddItem2) with protocol_status = Ok and no payload, and assert the client surfaces Error::MalformedReply.
  2. Have the fake reply to WriteBulk with protocol_status = Ok and the wrong payload arm (e.g. an AddItemReply body), and assert Error::MalformedReply.
  3. Have the fake fail the unary Invoke with Status::unavailable(...) and assert Error::Unavailable.

Optionally add Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk smoke assertions so all four bulk-write families have at least one round-trip test.

Resolution: 2026-05-20 — Added eight new integration tests in clients/rust/tests/client_behavior.rs. Each new Error::MalformedReply site is exercised via a test-only InvokeOverride injected into FakeState that lets a single test pin the fake gateway's Invoke handler to one of three malformed shapes (OK reply with no payload, OK reply with the wrong payload arm for read_bulk, OK reply with the wrong payload arm for the other bulk / bulk-write families): register_returns_malformed_reply_when_ok_reply_has_no_payload, add_item_returns_malformed_reply_when_ok_reply_has_no_payload, add_item2_returns_malformed_reply_when_ok_reply_has_no_payload, subscribe_bulk_returns_malformed_reply_on_mismatched_payload_arm, write_bulk_returns_malformed_reply_on_mismatched_payload_arm, and read_bulk_returns_malformed_reply_on_mismatched_payload_arm. The unary Error::Unavailable path is covered by unary_invoke_maps_status_unavailable_to_error_unavailable (the override returns Status::unavailable(...)). The remaining three bulk-write families gained round-trip smoke tests — write2_bulk_round_trips_through_the_fake_gateway, write_secured_bulk_round_trips_through_the_fake_gateway, write_secured2_bulk_round_trips_through_the_fake_gateway — extending the fake gateway's dispatcher with happy-path replies for Write2Bulk / WriteSecuredBulk / WriteSecured2Bulk. The bench-read-bulk flow gets a BenchReadBulkStats unit test in crates/mxgw-cli/src/main.rs (see Client.Rust-015) that asserts the latency-tracking change keeps failed-call durations out of latencyMs.

Client.Rust-017

Field Value
Severity Low
Category Design-document adherence
Location clients/rust/RustClientDesign.md:79-99,156-163
Status Resolved

Description: CLAUDE.md requires docs to change with the source. RustClientDesign.md was refreshed to fix the layout/tracing drift (Client.Rust-007), but the Session API surface in the design (Library API block, lines 79-99) still lists only the original six bulk helpers — add_item_bulk, advise_item_bulk, remove_item_bulk, un_advise_item_bulk, subscribe_bulk, unsubscribe_bulk — and is missing the five new bulk-write helpers and read_bulk (write_bulk, write2_bulk, write_secured_bulk, write_secured2_bulk, read_bulk) that landed in commits 5e375f6 / f220908 / 61644e6. The Error Handling block (lines 130-146) still enumerates Transport, Status, Authentication, Authorization, Session, Worker, Command, MxAccess, Timeout, Cancelled — but not MalformedReply, Unavailable, or InvalidEndpoint, all of which are now public variants of the crate's Error enum. The Test CLI block (lines 158-163) lists version / smoke / stream-events / write but is missing every new subcommand (read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk, galaxy watch).

Recommendation: Bring the design doc back in sync: extend the Session API code block to enumerate the bulk-write/read methods, expand the Error enum to match clients/rust/src/error.rs, and add the missing CLI subcommands. The README is already up to date, so this is design-doc-only churn.

Resolution: 2026-05-20 — Brought clients/rust/RustClientDesign.md back in sync with the implementation. The Session block now lists the five new bulk helpers (write_bulk, write2_bulk, write_secured_bulk, write_secured2_bulk, read_bulk) alongside the original six and notes that session::next_correlation_id is pub for raw-RPC consumers (the CLI). The Error enum block now matches clients/rust/src/error.rsInvalidEndpoint, InvalidArgument, Transport, Authentication, Authorization, Timeout, Cancelled, Unavailable, Status, Command, ProtocolStatus, MalformedReply — with a short paragraph explaining what Unavailable, MalformedReply, and InvalidEndpoint classify. The Test CLI block enumerates every subcommand the binary exposes today: version, ping, open-session, close-session, register, add-item, advise, subscribe-bulk, unsubscribe-bulk, read-bulk, write, write2, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, stream-events, bench-read-bulk, smoke, and the galaxy {test-connection,last-deploy-time,discover-hierarchy,watch} subtree.

Client.Rust-018

Field Value
Severity Medium
Category Error handling & resilience
Location clients/rust/crates/mxgw-cli/src/main.rs:1098-1170; scripts/bench-read-bulk.ps1:347-365; siblings: clients/go/cmd/mxgw-go/main.go:600-648, clients/python/src/mxgateway_cli/commands.py:614-662, clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:685-770, clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:855-940
Status Resolved

Description: Client.Rust-015's resolution split Rust's bench histogram so latencyMs records only successful read_bulk calls and a new failureLatencyMs field holds failed-call durations. The local logic is right, the unit test (bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram) is right, and the JSON shape stays additively compatible with scripts/bench-read-bulk.ps1 (the collator reads $s.latencyMs.p50/p95/p99/max/mean and these keys still exist on the Rust output). The problem is cross-language: the .NET, Go, Python, and Java bench implementations still push every call's elapsed time into a single latenciesMs / latencies_ms / latencyMillis array regardless of success or failure (e.g. clients/go/cmd/mxgw-go/main.go:611 appends before the success/failure branch; clients/python/src/mxgateway_cli/commands.py:624,626 appends in both except and the happy path; clients/dotnet/MxGateway.Client.Cli/MxGatewayClientCli.cs:701,705 adds in both catch and the OK path; clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:865,880 records in both branches). The PS driver's side-by-side comparison table (lines 348-360) pulls latencyMs.p50/p95/p99/max/mean from every client and prints them in one row, so a partial-failure run now shows Rust's p99 measured over successes only and the other four clients' p99 measured over (success + per-call timeout) — the numbers are silently no longer comparable. This re-introduces the original Client.Rust-015 problem at the cross-language layer that the fix was meant to remove.

Recommendation: Make the contract uniform. Either (a) revert Rust's latencyMs to the all-calls histogram for backwards/cross-language compatibility and keep failureLatencyMs as an additive Rust-only enrichment, or (b) push the same success-only / failure-separated split into the .NET, Go, Python, and Java bench commands so every language emits the honest pair (latencyMs = success, failureLatencyMs = failure, plus firstFailure) and update the PS driver's table column to make the success-only semantics explicit (p99 ok ms). Option (b) is the better long-term posture but it is a cross-client change; option (a) restores comparability immediately.

Resolution: 2026-05-20 — Took option (a) to restore cross-language comparability immediately. Reverted Rust's latencyMs to the all-calls histogram so it matches the .NET/Go/Python/Java bench shape that scripts/bench-read-bulk.ps1 collates side-by-side: BenchReadBulkStats::record_success and record_failure now both push elapsed time into a single latencies_ms vector, and record_failure additionally pushes into failure_latencies_ms and stashes the first failure's redacted error string in first_failure. The JSON output keeps failureLatencyMs and firstFailure as Rust-only additive enrichment so a partial-failure run is still visible at the report layer without breaking the side-by-side table. Renamed the unit test to bench_read_bulk_stats_tracks_all_calls_in_latency_and_failures_separately; it now asserts latencyMs.max == 1500.0 (the slow failure is included in the cross-language latencyMs contract) while failureLatencyMs.max == 1500.0 and firstFailure still surface the failure separately for diagnostics. Pushing the success-only / failure-separated split into the other four clients (option (b)) is the better long-term posture but is deliberately out of scope here.

Client.Rust-019

Field Value
Severity Low
Category Design-document adherence
Location clients/rust/RustClientDesign.md:96-100
Status Resolved

Description: Client.Rust-017 was closed by adding the new bulk-write/read entries to the design doc, but the signatures shown in the code block do not match the implementation. The doc declares:

pub async fn write_bulk(&self, server_handle: i32, entries: Vec<WriteBulkEntry>, user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write2_bulk(&self, server_handle: i32, entries: Vec<Write2BulkEntry>, timestamp: prost_types::Timestamp, user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec<WriteSecuredBulkEntry>, current_user_id: i32, verifier_user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec<WriteSecured2BulkEntry>, timestamp: prost_types::Timestamp, current_user_id: i32, verifier_user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn read_bulk(&self, server_handle: i32, tags: &[String], timeout_ms: u32) -> Result<Vec<ReadBulkResult>, Error>;

The actual implementations in clients/rust/src/session.rs:385-526 take only (server_handle, entries)user_id is per-entry on WriteBulkEntry/Write2BulkEntry, timestamp_value is per-entry on Write2BulkEntry/WriteSecured2BulkEntry, and current_user_id/verifier_user_id are per-entry on WriteSecured{,2}BulkEntry. The protobuf in src/MxGateway.Contracts/Protos/mxaccess_gateway.proto:364-416 confirms this — there is no top-level user_id on these commands. The doc also returns Vec<ReadBulkResult> but the generated type is BulkReadResult (the gateway's BulkReadReply carries repeated BulkReadResult), and the actual signature is read_bulk<S: AsRef<str>>(..., tag_addresses: &[S], ...) -> Vec<BulkReadResult> — generic over AsRef<str> so callers can pass either Vec<String> or [&str].

The drift is small but the design doc was the explicit subject of Client.Rust-017's resolution, so it warrants a follow-up. CLAUDE.md requires docs to change with the source.

Recommendation: Replace the five signatures in RustClientDesign.md:96-100 with the ones actually in session.rs:

pub async fn write_bulk(&self, server_handle: i32, entries: Vec<WriteBulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write2_bulk(&self, server_handle: i32, entries: Vec<Write2BulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec<WriteSecuredBulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec<WriteSecured2BulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
pub async fn read_bulk<S: AsRef<str>>(&self, server_handle: i32, tag_addresses: &[S], timeout_ms: u32) -> Result<Vec<BulkReadResult>, Error>;

and add a one-line note that the per-entry fields (user_id, timestamp_value, current_user_id, verifier_user_id) live on the entry structs themselves.

Resolution: 2026-05-20 — Replaced the five drifted signatures in RustClientDesign.md with the ones that actually live in clients/rust/src/session.rs: write_bulk / write2_bulk / write_secured_bulk / write_secured2_bulk take only (server_handle, entries), and read_bulk<S: AsRef<str>> takes a borrowed &[S] and returns Vec<BulkReadResult> (not Vec<ReadBulkResult>). Added a follow-up paragraph noting that the per-entry fields user_id / timestamp_value / current_user_id / verifier_user_id live on WriteBulkEntry / Write2BulkEntry / WriteSecuredBulkEntry / WriteSecured2BulkEntry themselves rather than as trailing positional arguments, matching the protobuf shapes in mxaccess_gateway.proto, and that read_bulk is generic over AsRef<str> so callers can pass &[String] or &[&str] without cloning at the call site.

Client.Rust-020

Field Value
Severity Low
Category Documentation & comments
Location clients/rust/src/session.rs:31-46; clients/rust/src/lib.rs:14-39
Status Resolved

Description: Client.Rust-014's resolution promoted next_correlation_id from a module-private helper to a pub function so the mxgw CLI's raw-RPC paths can share the library's correlation-id discipline. The doc comment commits the library to a literal string format — "rust-client-{label}-{N}" — that external code can now depend on. Two concerns:

  1. The function is not re-exported at the crate root in lib.rs (it only ships through the pub mod session namespace), so the in-tree caller writes the long mxgateway_client::session::next_correlation_id("cli-ping") path. Either re-export it via #[doc(inline)] pub use session::next_correlation_id; or leave it where it is and add a short note in the doc — but the current state straddles "public API" and "lib-internal helper" without saying which.

  2. The doc comment does not declare a stability stance (no #[doc(hidden)], no "experimental" note, no __priv naming). As written it promises the literal format "rust-client-{label}-{N}" to any out-of-tree consumer; a future change that renames the prefix (for example to drop the rust- after a multi-client reformat) would be a behavioural break. The RustClientDesign.md resolution of Client.Rust-017 ("session::next_correlation_id is pub") reads similarly — it does not say whether the format is stable.

The combination — pub, format-committing doc, no stability note, no crate-root re-export — leaves the public surface ambiguous. The same review category (Documentation & comments) is where Client.Rust-014's CLI-side fix is now visible, so this is the natural place to clean it up.

Recommendation: Pick one of:

  • Treat next_correlation_id as part of the SDK's public API. Re-export it from lib.rs (#[doc(inline)] pub use session::next_correlation_id;), rewrite the doc comment to not promise the literal "rust-client-{label}-{N}" format (just the property "monotonic, unique within a process, includes the supplied label"), and call that out in RustClientDesign.md.
  • Treat it as internal-only. Mark it #[doc(hidden)] pub and add a // Internal helper exposed for the in-tree mxgw CLI; not part of the public SDK contract. comment so out-of-tree consumers do not build against a format that the SDK is free to change.

The CLI integration in Client.Rust-014 works either way; this is solely about declaring intent so the SDK's public surface is unambiguous.

Resolution: 2026-05-20 — Took the "treat as public SDK API" branch. Re-exported next_correlation_id at the crate root in clients/rust/src/lib.rs (#[doc(inline)] pub use session::{next_correlation_id, Session};) so in-tree and external callers can write the short mxgateway_client::next_correlation_id(...) path. Updated the in-tree mxgw CLI (Ping and CloseSession subcommands) to call through the crate-root re-export instead of mxgateway_client::session::next_correlation_id. Rewrote the doc comment to drop the format promise: the returned id is now documented as an opaque token with three guaranteed properties (embeds the supplied label, unique within a process via an internal monotonic atomic sequence, carries no embedded secret beyond label), and the doc explicitly states that the textual format rust-client-{label}-{N} is not part of the public contract and that callers must not parse it. Cross-referenced the crate-root re-export from the function-level doc. Updated RustClientDesign.md to call out that next_correlation_id is part of the public SDK surface, re-exported at the crate root, and that its textual format is intentionally not part of the contract.