Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+27 -23
View File
@@ -447,9 +447,7 @@ async fn run(cli: Cli) -> Result<(), Error> {
let client = connect(connection).await?;
let reply = client
.invoke(MxCommandRequest {
client_correlation_id: mxgateway_client::session::next_correlation_id(
"cli-ping",
),
client_correlation_id: mxgateway_client::next_correlation_id("cli-ping"),
command: Some(MxCommand {
kind: MxCommandKind::Ping as i32,
payload: Some(mxgateway_client::generated::mxaccess_gateway::v1::mx_command::Payload::Ping(
@@ -496,7 +494,7 @@ async fn run(cli: Cli) -> Result<(), Error> {
let reply = client
.close_session_raw(CloseSessionRequest {
session_id,
client_correlation_id: mxgateway_client::session::next_correlation_id(
client_correlation_id: mxgateway_client::next_correlation_id(
"cli-close-session",
),
})
@@ -1088,16 +1086,17 @@ async fn run_bench_read_bulk(
/// Per-iteration accounting for `bench-read-bulk`.
///
/// Only successful `read_bulk` calls contribute to the success-latency
/// histogram (`success_latencies_ms`). Failures are tracked separately in
/// `failure_latencies_ms` and the first failure's redacted error string is
/// stashed in `first_failure` so a partial-failure run is visible in the
/// emitted JSON. This keeps the cross-language `latencyMs.p99`/`max`
/// contract honest: it reports successful-call latency only and never
/// folds in a per-call timeout from a failed RPC.
/// Every `read_bulk` call's elapsed time contributes to the all-calls
/// histogram (`latencies_ms`), matching the .NET/Go/Python/Java bench
/// implementations whose `latencyMs` field is the cross-language comparison
/// contract collated by `scripts/bench-read-bulk.ps1`. Failures additionally
/// land in `failure_latencies_ms` and the first failure's redacted error
/// string is stashed in `first_failure`, both surfaced through the JSON as
/// Rust-only enrichment so a partial-failure run is still visible at the
/// report layer without breaking the side-by-side comparison.
#[derive(Default)]
struct BenchReadBulkStats {
success_latencies_ms: Vec<f64>,
latencies_ms: Vec<f64>,
failure_latencies_ms: Vec<f64>,
total_read_results: u64,
cached_read_results: u64,
@@ -1112,7 +1111,7 @@ impl BenchReadBulkStats {
elapsed_ms: f64,
results: &[mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult],
) {
self.success_latencies_ms.push(elapsed_ms);
self.latencies_ms.push(elapsed_ms);
self.successful_calls += 1;
for result in results {
self.total_read_results += 1;
@@ -1123,6 +1122,7 @@ impl BenchReadBulkStats {
}
fn record_failure(&mut self, elapsed_ms: f64, error: &Error) {
self.latencies_ms.push(elapsed_ms);
self.failure_latencies_ms.push(elapsed_ms);
self.failed_calls += 1;
if self.first_failure.is_none() {
@@ -1145,7 +1145,7 @@ impl BenchReadBulkStats {
fn to_json(&self, context: &BenchReadBulkContext<'_>) -> serde_json::Value {
let calls_per_second = self.calls_per_second(context.steady_elapsed);
let success_summary = percentile_summary(&self.success_latencies_ms);
let latency_summary = percentile_summary(&self.latencies_ms);
let failure_summary = percentile_summary(&self.failure_latencies_ms);
serde_json::json!({
"language": "rust",
@@ -1163,7 +1163,7 @@ impl BenchReadBulkStats {
"totalReadResults": self.total_read_results,
"cachedReadResults": self.cached_read_results,
"callsPerSecond": round_to(calls_per_second, 2),
"latencyMs": success_summary,
"latencyMs": latency_summary,
"failureLatencyMs": failure_summary,
"firstFailure": self.first_failure,
})
@@ -1737,7 +1737,7 @@ mod tests {
}
#[test]
fn bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram() {
fn bench_read_bulk_stats_tracks_all_calls_in_latency_and_failures_separately() {
use mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult;
use mxgateway_client::Error;
@@ -1753,8 +1753,10 @@ mod tests {
..BulkReadResult::default()
};
// Two fast successes and one slow failure: the slow failure must
// not pollute the success p99/max histogram.
// Two fast successes and one slow failure: every call lands in the
// all-calls histogram (the cross-language `latencyMs` contract) and
// the failure additionally surfaces through `failureLatencyMs` plus
// `firstFailure` as Rust-only enrichment.
stats.record_success(1.5, std::slice::from_ref(&cached));
stats.record_success(2.0, std::slice::from_ref(&uncached));
let failure = Error::MalformedReply {
@@ -1762,7 +1764,7 @@ mod tests {
};
stats.record_failure(1_500.0, &failure);
assert_eq!(stats.success_latencies_ms, vec![1.5, 2.0]);
assert_eq!(stats.latencies_ms, vec![1.5, 2.0, 1_500.0]);
assert_eq!(stats.failure_latencies_ms, vec![1_500.0]);
assert_eq!(stats.successful_calls, 2);
assert_eq!(stats.failed_calls, 1);
@@ -1786,10 +1788,12 @@ mod tests {
tags: &[],
};
let payload = stats.to_json(&context);
// The success-latency histogram must never see the 1_500 ms failure.
assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 2.0);
assert!(payload["latencyMs"]["p99"].as_f64().unwrap() <= 2.0);
// The failure-latency histogram must own it instead.
// The all-calls histogram (cross-language `latencyMs` contract)
// includes the failure latency so the side-by-side comparison with
// .NET/Go/Python/Java stays apples-to-apples.
assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 1_500.0);
// The Rust-only `failureLatencyMs` enrichment surfaces failures
// separately for partial-failure diagnostics.
assert_eq!(
payload["failureLatencyMs"]["max"].as_f64().unwrap(),
1_500.0