Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist
(REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and
fixed them in three priority waves (3 High, 17 Medium, 52 Low).

Highs
- Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in
  GatewayGrpcScopeResolver so non-admin keys can use them; document
  the mapping in docs/Authorization.md; add interceptor tests.
- Client.Java-013: add the five missing bulk-method stubs to the
  CLI FakeSession so the test module compiles on a clean tree.
- Client.Rust-013: fix the clippy::doc_lazy_continuation regression
  in generated tonic code by reformatting the ReadBulkCommand proto
  comment and scoping a #![allow(...)] to the generated submodules.

Mediums (highlights)
- Server: unify GatewaySession state-lock discipline (-015) and
  make DisposeAsync race-safe against in-flight CloseAsync (-016);
  add constraint-enforcement test coverage for the bulk-plan path
  (-021).
- Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop
  can distinguish graceful shutdown from a real STA-affinity
  violation (-016); have the watchdog skip StaHung while
  CurrentCommandCorrelationId is non-empty so a legitimate slow
  ReadBulk no longer self-faults (-017).
- Tests: add per-method round-trip + cancellation coverage for the
  11 GatewaySession bulk methods (-013); replace the real TCP probe
  in GalaxyHierarchyCacheTests with an IGalaxyRepository fake
  (-016).
- IntegrationTests: drive the StreamEvents writer in the live Write
  test and assert OnWriteComplete (-012); add live tests for
  Unadvise/RemoveItem/Unregister ordering, WriteSecured, and
  abnormal worker exit (-014).
- Worker.Tests: replace MxAccessSession reflection with an internal
  CreateForTesting factory (-016); cover WorkerCancel and
  unexpected-body envelope branches (-017).
- Client.Java: cancel MxEventStream when close() races
  beforeStart() (-014); return a CancellingCompletableFuture that
  actually forwards cancellation through .thenApply chains (-015).
- Client.Python: drop the silent localhost-plaintext downgrade in
  the CLI; require explicit --plaintext (-013).
- Client.Rust: stop bench-read-bulk from polluting success-latency
  histograms with failed-call durations (-015); add coverage for
  the five MalformedReply paths, the bulk-write helpers, the
  Error::Unavailable mapping, and the unary-fault path (-016).
- Contracts: extend docs/Contracts.md with the bulk read/write
  command family (-009).

Lows (highlights)
- Server: cap GalaxyGlobMatcher.RegexCache; align
  WorkerAlarmRpcDispatcher missing-session handling; drop the
  duplicate dashboard @page routes; refresh IAlarmRpcDispatcher
  XML doc.
- Worker: surface SetXmlAlarmQuery COM failures; remove dead
  subscriptionExpression / ExecutingCommand arms; preserve
  factory-supplied runtime sessions; split MxAlarmSnapshot.cs into
  three files.
- Tests: dispose the WebApplication in seven test classes; rebuild
  FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion
  source; switch the heartbeat-expires test to ManualTimeProvider;
  add InvariantCulture to the remaining DateTimeOffset.Parse sites;
  document GalaxyFilterInputSafetyTests in GatewayTesting.md.
- IntegrationTests: comment fixes, RecordingServerStreamWriter
  IDisposable, class-level [Trait], single-source ZB default
  connection string.
- Worker.Tests: replace silent-return gating with LiveMxAccessFact
  so absent env vars SKIP not pass; PascalCase rename of probe
  [Fact]s; deterministic deadline test; new frame-protocol error
  tests; ComputeTransitions diff-coverage; relocate dev-rig probes
  to Probes/.
- Contracts: add round-trip coverage and per-field redaction /
  Galaxy-identifier comments to the protos.
- Client.Dotnet: introduce clients/dotnet/Directory.Build.props so
  TreatWarningsAsErrors / analysers apply; document
  DiscoverHierarchyOptions and IMxGatewayCliClient; require typed
  bulk-read handles in CLI; surface AcknowledgeAlarm transport
  faults through Translate().
- Client.Go: kill dead code in alarms_test / fakeGalaxyServer /
  runWriteBulkVariant; document the six new subcommands in
  writeUsage; drain galaxy-watch events on limit; switch io.EOF
  comparisons to errors.Is.
- Client.Java: shared shutdown helpers + new shutdownTimeout
  option; regex-based credential redaction; Long.toUnsignedString
  for uint64 sequence; doc fixes.
- Client.Python: combine duplicate imports; add coverage for
  _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS /
  _api_key_from_env; populate pyproject metadata and ship py.typed.
- Client.Rust: expose next_correlation_id() so CLI ping/close
  stop hard-coding correlation IDs; resync RustClientDesign.md
  with the current Session / Error surface and CLI subcommand set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
+199 -51
View File
@@ -447,7 +447,9 @@ async fn run(cli: Cli) -> Result<(), Error> {
let client = connect(connection).await?;
let reply = client
.invoke(MxCommandRequest {
client_correlation_id: "rust-cli-ping".to_owned(),
client_correlation_id: mxgateway_client::session::next_correlation_id(
"cli-ping",
),
command: Some(MxCommand {
kind: MxCommandKind::Ping as i32,
payload: Some(mxgateway_client::generated::mxaccess_gateway::v1::mx_command::Payload::Ping(
@@ -494,7 +496,9 @@ async fn run(cli: Cli) -> Result<(), Error> {
let reply = client
.close_session_raw(CloseSessionRequest {
session_id,
client_correlation_id: "rust-cli-close-session".to_owned(),
client_correlation_id: mxgateway_client::session::next_correlation_id(
"cli-close-session",
),
})
.await?;
if json {
@@ -1034,19 +1038,13 @@ async fn run_bench_read_bulk(
.map(|r| r.item_handle)
.collect();
let warmup_deadline = std::time::Instant::now()
+ std::time::Duration::from_secs(warmup_seconds);
let warmup_deadline =
std::time::Instant::now() + std::time::Duration::from_secs(warmup_seconds);
while std::time::Instant::now() < warmup_deadline {
let _ = session
.read_bulk(server_handle, &tags, timeout_ms)
.await;
let _ = session.read_bulk(server_handle, &tags, timeout_ms).await;
}
let mut latencies_ms: Vec<f64> = Vec::with_capacity(65_536);
let mut total_read_results: u64 = 0;
let mut cached_read_results: u64 = 0;
let mut successful_calls: u64 = 0;
let mut failed_calls: u64 = 0;
let mut stats = BenchReadBulkStats::default();
let steady_start = std::time::Instant::now();
let steady_deadline = steady_start + std::time::Duration::from_secs(duration_seconds);
@@ -1054,18 +1052,9 @@ async fn run_bench_read_bulk(
let call_start = std::time::Instant::now();
let outcome = session.read_bulk(server_handle, &tags, timeout_ms).await;
let elapsed_ms = call_start.elapsed().as_secs_f64() * 1000.0;
latencies_ms.push(elapsed_ms);
match outcome {
Ok(results) => {
successful_calls += 1;
for r in &results {
total_read_results += 1;
if r.was_cached {
cached_read_results += 1;
}
}
}
Err(_) => failed_calls += 1,
Ok(results) => stats.record_success(elapsed_ms, &results),
Err(error) => stats.record_failure(elapsed_ms, &error),
}
}
let steady_elapsed = steady_start.elapsed();
@@ -1074,36 +1063,20 @@ async fn run_bench_read_bulk(
let _ = session.unsubscribe_bulk(server_handle, item_handles).await;
}
let total_calls = successful_calls + failed_calls;
let calls_per_second = if steady_elapsed.as_secs_f64() > 0.0 {
total_calls as f64 / steady_elapsed.as_secs_f64()
} else {
0.0
let context = BenchReadBulkContext {
endpoint: &endpoint,
client_name: &client_name,
bulk_size,
duration_seconds,
warmup_seconds,
steady_elapsed,
tags: &tags,
};
let summary = percentile_summary(&latencies_ms);
let stats = serde_json::json!({
"language": "rust",
"command": "bench-read-bulk",
"endpoint": endpoint,
"clientName": client_name,
"bulkSize": bulk_size,
"durationSeconds": duration_seconds,
"warmupSeconds": warmup_seconds,
"durationMs": steady_elapsed.as_millis() as u64,
"tags": tags,
"totalCalls": total_calls,
"successfulCalls": successful_calls,
"failedCalls": failed_calls,
"totalReadResults": total_read_results,
"cachedReadResults": cached_read_results,
"callsPerSecond": round_to(calls_per_second, 2),
"latencyMs": summary,
});
let json_stats = stats.to_json(&context);
if use_json {
println!("{}", stats);
println!("{}", json_stats);
} else {
println!("{calls_per_second}");
println!("{}", stats.calls_per_second(steady_elapsed));
}
Ok::<(), Error>(())
}
@@ -1113,6 +1086,102 @@ async fn run_bench_read_bulk(
bench_outcome
}
/// Per-iteration accounting for `bench-read-bulk`.
///
/// Only successful `read_bulk` calls contribute to the success-latency
/// histogram (`success_latencies_ms`). Failures are tracked separately in
/// `failure_latencies_ms` and the first failure's redacted error string is
/// stashed in `first_failure` so a partial-failure run is visible in the
/// emitted JSON. This keeps the cross-language `latencyMs.p99`/`max`
/// contract honest: it reports successful-call latency only and never
/// folds in a per-call timeout from a failed RPC.
#[derive(Default)]
struct BenchReadBulkStats {
success_latencies_ms: Vec<f64>,
failure_latencies_ms: Vec<f64>,
total_read_results: u64,
cached_read_results: u64,
successful_calls: u64,
failed_calls: u64,
first_failure: Option<String>,
}
impl BenchReadBulkStats {
fn record_success(
&mut self,
elapsed_ms: f64,
results: &[mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult],
) {
self.success_latencies_ms.push(elapsed_ms);
self.successful_calls += 1;
for result in results {
self.total_read_results += 1;
if result.was_cached {
self.cached_read_results += 1;
}
}
}
fn record_failure(&mut self, elapsed_ms: f64, error: &Error) {
self.failure_latencies_ms.push(elapsed_ms);
self.failed_calls += 1;
if self.first_failure.is_none() {
self.first_failure = Some(error.to_string());
}
}
fn total_calls(&self) -> u64 {
self.successful_calls + self.failed_calls
}
fn calls_per_second(&self, elapsed: std::time::Duration) -> f64 {
let seconds = elapsed.as_secs_f64();
if seconds > 0.0 {
self.total_calls() as f64 / seconds
} else {
0.0
}
}
fn to_json(&self, context: &BenchReadBulkContext<'_>) -> serde_json::Value {
let calls_per_second = self.calls_per_second(context.steady_elapsed);
let success_summary = percentile_summary(&self.success_latencies_ms);
let failure_summary = percentile_summary(&self.failure_latencies_ms);
serde_json::json!({
"language": "rust",
"command": "bench-read-bulk",
"endpoint": context.endpoint,
"clientName": context.client_name,
"bulkSize": context.bulk_size,
"durationSeconds": context.duration_seconds,
"warmupSeconds": context.warmup_seconds,
"durationMs": context.steady_elapsed.as_millis() as u64,
"tags": context.tags,
"totalCalls": self.total_calls(),
"successfulCalls": self.successful_calls,
"failedCalls": self.failed_calls,
"totalReadResults": self.total_read_results,
"cachedReadResults": self.cached_read_results,
"callsPerSecond": round_to(calls_per_second, 2),
"latencyMs": success_summary,
"failureLatencyMs": failure_summary,
"firstFailure": self.first_failure,
})
}
}
/// Static configuration for one `bench-read-bulk` run, packaged so the
/// JSON serialiser can quote it back without taking eight positional args.
struct BenchReadBulkContext<'a> {
endpoint: &'a str,
client_name: &'a str,
bulk_size: usize,
duration_seconds: u64,
warmup_seconds: u64,
steady_elapsed: std::time::Duration,
tags: &'a [String],
}
fn percentile_summary(sample: &[f64]) -> serde_json::Value {
if sample.is_empty() {
return serde_json::json!({ "p50": 0.0, "p95": 0.0, "p99": 0.0, "max": 0.0, "mean": 0.0 });
@@ -1294,7 +1363,13 @@ fn build_write_bulk_entries(
item_handles: &[i32],
value_type: CliValueType,
values: &[String],
) -> Result<Vec<(i32, mxgateway_client::generated::mxaccess_gateway::v1::MxValue)>, Error> {
) -> Result<
Vec<(
i32,
mxgateway_client::generated::mxaccess_gateway::v1::MxValue,
)>,
Error,
> {
if item_handles.len() != values.len() {
return Err(Error::InvalidArgument {
name: "values".to_owned(),
@@ -1660,4 +1735,77 @@ mod tests {
assert_eq!(frac.seconds, utc.seconds);
assert_eq!(frac.nanos, 250_000_000);
}
#[test]
fn bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram() {
use mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult;
use mxgateway_client::Error;
let mut stats = super::BenchReadBulkStats::default();
let cached = BulkReadResult {
was_cached: true,
was_successful: true,
..BulkReadResult::default()
};
let uncached = BulkReadResult {
was_cached: false,
was_successful: true,
..BulkReadResult::default()
};
// Two fast successes and one slow failure: the slow failure must
// not pollute the success p99/max histogram.
stats.record_success(1.5, std::slice::from_ref(&cached));
stats.record_success(2.0, std::slice::from_ref(&uncached));
let failure = Error::MalformedReply {
detail: "synthetic failure for the bench test".to_owned(),
};
stats.record_failure(1_500.0, &failure);
assert_eq!(stats.success_latencies_ms, vec![1.5, 2.0]);
assert_eq!(stats.failure_latencies_ms, vec![1_500.0]);
assert_eq!(stats.successful_calls, 2);
assert_eq!(stats.failed_calls, 1);
assert_eq!(stats.total_calls(), 3);
assert_eq!(stats.total_read_results, 2);
assert_eq!(stats.cached_read_results, 1);
assert!(stats
.first_failure
.as_deref()
.unwrap()
.contains("synthetic failure"));
let elapsed = std::time::Duration::from_secs(1);
let context = super::BenchReadBulkContext {
endpoint: "http://fake",
client_name: "client",
bulk_size: 2,
duration_seconds: 1,
warmup_seconds: 0,
steady_elapsed: elapsed,
tags: &[],
};
let payload = stats.to_json(&context);
// The success-latency histogram must never see the 1_500 ms failure.
assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 2.0);
assert!(payload["latencyMs"]["p99"].as_f64().unwrap() <= 2.0);
// The failure-latency histogram must own it instead.
assert_eq!(
payload["failureLatencyMs"]["max"].as_f64().unwrap(),
1_500.0
);
assert_eq!(payload["failedCalls"].as_u64().unwrap(), 1);
assert_eq!(payload["successfulCalls"].as_u64().unwrap(), 2);
assert!(payload["firstFailure"]
.as_str()
.unwrap()
.contains("synthetic failure"));
}
#[test]
fn bench_read_bulk_stats_calls_per_second_handles_zero_duration() {
let stats = super::BenchReadBulkStats::default();
assert_eq!(stats.calls_per_second(std::time::Duration::ZERO), 0.0);
}
}