Code-review 2026-05-20 sweep: re-review at 1cd51bb, resolve 72 findings across all 11 modules

Re-reviewed every module/client against the 10-category checklist (REVIEW-PROCESS.md) at commit 1cd51bb, filed 72 new findings, and fixed them in three priority waves (3 High, 17 Medium, 52 Low). Highs - Server-017: enumerate AcknowledgeAlarm / QueryActiveAlarms in GatewayGrpcScopeResolver so non-admin keys can use them; document the mapping in docs/Authorization.md; add interceptor tests. - Client.Java-013: add the five missing bulk-method stubs to the CLI FakeSession so the test module compiles on a clean tree. - Client.Rust-013: fix the clippy::doc_lazy_continuation regression in generated tonic code by reformatting the ReadBulkCommand proto comment and scoping a #![allow(...)] to the generated submodules. Mediums (highlights) - Server: unify GatewaySession state-lock discipline (-015) and make DisposeAsync race-safe against in-flight CloseAsync (-016); add constraint-enforcement test coverage for the bulk-plan path (-021). - Worker: introduce StaRuntimeShutdownException so RunAlarmPollLoop can distinguish graceful shutdown from a real STA-affinity violation (-016); have the watchdog skip StaHung while CurrentCommandCorrelationId is non-empty so a legitimate slow ReadBulk no longer self-faults (-017). - Tests: add per-method round-trip + cancellation coverage for the 11 GatewaySession bulk methods (-013); replace the real TCP probe in GalaxyHierarchyCacheTests with an IGalaxyRepository fake (-016). - IntegrationTests: drive the StreamEvents writer in the live Write test and assert OnWriteComplete (-012); add live tests for Unadvise/RemoveItem/Unregister ordering, WriteSecured, and abnormal worker exit (-014). - Worker.Tests: replace MxAccessSession reflection with an internal CreateForTesting factory (-016); cover WorkerCancel and unexpected-body envelope branches (-017). - Client.Java: cancel MxEventStream when close() races beforeStart() (-014); return a CancellingCompletableFuture that actually forwards cancellation through .thenApply chains (-015). - Client.Python: drop the silent localhost-plaintext downgrade in the CLI; require explicit --plaintext (-013). - Client.Rust: stop bench-read-bulk from polluting success-latency histograms with failed-call durations (-015); add coverage for the five MalformedReply paths, the bulk-write helpers, the Error::Unavailable mapping, and the unary-fault path (-016). - Contracts: extend docs/Contracts.md with the bulk read/write command family (-009). Lows (highlights) - Server: cap GalaxyGlobMatcher.RegexCache; align WorkerAlarmRpcDispatcher missing-session handling; drop the duplicate dashboard @page routes; refresh IAlarmRpcDispatcher XML doc. - Worker: surface SetXmlAlarmQuery COM failures; remove dead subscriptionExpression / ExecutingCommand arms; preserve factory-supplied runtime sessions; split MxAlarmSnapshot.cs into three files. - Tests: dispose the WebApplication in seven test classes; rebuild FakeWorkerProcess.WaitForExitAsync against a real TaskCompletion source; switch the heartbeat-expires test to ManualTimeProvider; add InvariantCulture to the remaining DateTimeOffset.Parse sites; document GalaxyFilterInputSafetyTests in GatewayTesting.md. - IntegrationTests: comment fixes, RecordingServerStreamWriter IDisposable, class-level [Trait], single-source ZB default connection string. - Worker.Tests: replace silent-return gating with LiveMxAccessFact so absent env vars SKIP not pass; PascalCase rename of probe [Fact]s; deterministic deadline test; new frame-protocol error tests; ComputeTransitions diff-coverage; relocate dev-rig probes to Probes/. - Contracts: add round-trip coverage and per-field redaction / Galaxy-identifier comments to the protos. - Client.Dotnet: introduce clients/dotnet/Directory.Build.props so TreatWarningsAsErrors / analysers apply; document DiscoverHierarchyOptions and IMxGatewayCliClient; require typed bulk-read handles in CLI; surface AcknowledgeAlarm transport faults through Translate(). - Client.Go: kill dead code in alarms_test / fakeGalaxyServer / runWriteBulkVariant; document the six new subcommands in writeUsage; drain galaxy-watch events on limit; switch io.EOF comparisons to errors.Is. - Client.Java: shared shutdown helpers + new shutdownTimeout option; regex-based credential redaction; Long.toUnsignedString for uint64 sequence; doc fixes. - Client.Python: combine duplicate imports; add coverage for _percentile / bench-read-bulk / MAX_AGGREGATE_EVENTS / _api_key_from_env; populate pyproject metadata and ship py.typed. - Client.Rust: expose next_correlation_id() so CLI ping/close stop hard-coding correlation IDs; resync RustClientDesign.md with the current Session / Error surface and CLI subcommand set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 09:46:47 -04:00
parent 1cd51bbda3
commit a0203503a7
122 changed files with 8723 additions and 757 deletions
@@ -447,7 +447,9 @@ async fn run(cli: Cli) -> Result<(), Error> {
            let client = connect(connection).await?;
            let reply = client
                .invoke(MxCommandRequest {
-                    client_correlation_id: "rust-cli-ping".to_owned(),
+                    client_correlation_id: mxgateway_client::session::next_correlation_id(
+                        "cli-ping",
+                    ),
                    command: Some(MxCommand {
                        kind: MxCommandKind::Ping as i32,
                        payload: Some(mxgateway_client::generated::mxaccess_gateway::v1::mx_command::Payload::Ping(
@@ -494,7 +496,9 @@ async fn run(cli: Cli) -> Result<(), Error> {
            let reply = client
                .close_session_raw(CloseSessionRequest {
                    session_id,
-                    client_correlation_id: "rust-cli-close-session".to_owned(),
+                    client_correlation_id: mxgateway_client::session::next_correlation_id(
+                        "cli-close-session",
+                    ),
                })
                .await?;
            if json {
@@ -1034,19 +1038,13 @@ async fn run_bench_read_bulk(
            .map(|r| r.item_handle)
            .collect();

-        let warmup_deadline = std::time::Instant::now()
-            + std::time::Duration::from_secs(warmup_seconds);
+        let warmup_deadline =
+            std::time::Instant::now() + std::time::Duration::from_secs(warmup_seconds);
        while std::time::Instant::now() < warmup_deadline {
-            let _ = session
-                .read_bulk(server_handle, &tags, timeout_ms)
-                .await;
+            let _ = session.read_bulk(server_handle, &tags, timeout_ms).await;
        }

-        let mut latencies_ms: Vec<f64> = Vec::with_capacity(65_536);
-        let mut total_read_results: u64 = 0;
-        let mut cached_read_results: u64 = 0;
-        let mut successful_calls: u64 = 0;
-        let mut failed_calls: u64 = 0;
+        let mut stats = BenchReadBulkStats::default();
        let steady_start = std::time::Instant::now();
        let steady_deadline = steady_start + std::time::Duration::from_secs(duration_seconds);

@@ -1054,18 +1052,9 @@ async fn run_bench_read_bulk(
            let call_start = std::time::Instant::now();
            let outcome = session.read_bulk(server_handle, &tags, timeout_ms).await;
            let elapsed_ms = call_start.elapsed().as_secs_f64() * 1000.0;
-            latencies_ms.push(elapsed_ms);
            match outcome {
-                Ok(results) => {
-                    successful_calls += 1;
-                    for r in &results {
-                        total_read_results += 1;
-                        if r.was_cached {
-                            cached_read_results += 1;
-                        }
-                    }
-                }
-                Err(_) => failed_calls += 1,
+                Ok(results) => stats.record_success(elapsed_ms, &results),
+                Err(error) => stats.record_failure(elapsed_ms, &error),
            }
        }
        let steady_elapsed = steady_start.elapsed();
@@ -1074,36 +1063,20 @@ async fn run_bench_read_bulk(
            let _ = session.unsubscribe_bulk(server_handle, item_handles).await;
        }

-        let total_calls = successful_calls + failed_calls;
-        let calls_per_second = if steady_elapsed.as_secs_f64() > 0.0 {
-            total_calls as f64 / steady_elapsed.as_secs_f64()
-        } else {
-            0.0
+        let context = BenchReadBulkContext {
+            endpoint: &endpoint,
+            client_name: &client_name,
+            bulk_size,
+            duration_seconds,
+            warmup_seconds,
+            steady_elapsed,
+            tags: &tags,
        };
-
-        let summary = percentile_summary(&latencies_ms);
-        let stats = serde_json::json!({
-            "language": "rust",
-            "command": "bench-read-bulk",
-            "endpoint": endpoint,
-            "clientName": client_name,
-            "bulkSize": bulk_size,
-            "durationSeconds": duration_seconds,
-            "warmupSeconds": warmup_seconds,
-            "durationMs": steady_elapsed.as_millis() as u64,
-            "tags": tags,
-            "totalCalls": total_calls,
-            "successfulCalls": successful_calls,
-            "failedCalls": failed_calls,
-            "totalReadResults": total_read_results,
-            "cachedReadResults": cached_read_results,
-            "callsPerSecond": round_to(calls_per_second, 2),
-            "latencyMs": summary,
-        });
+        let json_stats = stats.to_json(&context);
        if use_json {
-            println!("{}", stats);
+            println!("{}", json_stats);
        } else {
-            println!("{calls_per_second}");
+            println!("{}", stats.calls_per_second(steady_elapsed));
        }
        Ok::<(), Error>(())
    }
@@ -1113,6 +1086,102 @@ async fn run_bench_read_bulk(
    bench_outcome
 }

+/// Per-iteration accounting for `bench-read-bulk`.
+///
+/// Only successful `read_bulk` calls contribute to the success-latency
+/// histogram (`success_latencies_ms`). Failures are tracked separately in
+/// `failure_latencies_ms` and the first failure's redacted error string is
+/// stashed in `first_failure` so a partial-failure run is visible in the
+/// emitted JSON. This keeps the cross-language `latencyMs.p99`/`max`
+/// contract honest: it reports successful-call latency only and never
+/// folds in a per-call timeout from a failed RPC.
+#[derive(Default)]
+struct BenchReadBulkStats {
+    success_latencies_ms: Vec<f64>,
+    failure_latencies_ms: Vec<f64>,
+    total_read_results: u64,
+    cached_read_results: u64,
+    successful_calls: u64,
+    failed_calls: u64,
+    first_failure: Option<String>,
+}
+
+impl BenchReadBulkStats {
+    fn record_success(
+        &mut self,
+        elapsed_ms: f64,
+        results: &[mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult],
+    ) {
+        self.success_latencies_ms.push(elapsed_ms);
+        self.successful_calls += 1;
+        for result in results {
+            self.total_read_results += 1;
+            if result.was_cached {
+                self.cached_read_results += 1;
+            }
+        }
+    }
+
+    fn record_failure(&mut self, elapsed_ms: f64, error: &Error) {
+        self.failure_latencies_ms.push(elapsed_ms);
+        self.failed_calls += 1;
+        if self.first_failure.is_none() {
+            self.first_failure = Some(error.to_string());
+        }
+    }
+
+    fn total_calls(&self) -> u64 {
+        self.successful_calls + self.failed_calls
+    }
+
+    fn calls_per_second(&self, elapsed: std::time::Duration) -> f64 {
+        let seconds = elapsed.as_secs_f64();
+        if seconds > 0.0 {
+            self.total_calls() as f64 / seconds
+        } else {
+            0.0
+        }
+    }
+
+    fn to_json(&self, context: &BenchReadBulkContext<'_>) -> serde_json::Value {
+        let calls_per_second = self.calls_per_second(context.steady_elapsed);
+        let success_summary = percentile_summary(&self.success_latencies_ms);
+        let failure_summary = percentile_summary(&self.failure_latencies_ms);
+        serde_json::json!({
+            "language": "rust",
+            "command": "bench-read-bulk",
+            "endpoint": context.endpoint,
+            "clientName": context.client_name,
+            "bulkSize": context.bulk_size,
+            "durationSeconds": context.duration_seconds,
+            "warmupSeconds": context.warmup_seconds,
+            "durationMs": context.steady_elapsed.as_millis() as u64,
+            "tags": context.tags,
+            "totalCalls": self.total_calls(),
+            "successfulCalls": self.successful_calls,
+            "failedCalls": self.failed_calls,
+            "totalReadResults": self.total_read_results,
+            "cachedReadResults": self.cached_read_results,
+            "callsPerSecond": round_to(calls_per_second, 2),
+            "latencyMs": success_summary,
+            "failureLatencyMs": failure_summary,
+            "firstFailure": self.first_failure,
+        })
+    }
+}
+
+/// Static configuration for one `bench-read-bulk` run, packaged so the
+/// JSON serialiser can quote it back without taking eight positional args.
+struct BenchReadBulkContext<'a> {
+    endpoint: &'a str,
+    client_name: &'a str,
+    bulk_size: usize,
+    duration_seconds: u64,
+    warmup_seconds: u64,
+    steady_elapsed: std::time::Duration,
+    tags: &'a [String],
+}
+
 fn percentile_summary(sample: &[f64]) -> serde_json::Value {
    if sample.is_empty() {
        return serde_json::json!({ "p50": 0.0, "p95": 0.0, "p99": 0.0, "max": 0.0, "mean": 0.0 });
@@ -1294,7 +1363,13 @@ fn build_write_bulk_entries(
    item_handles: &[i32],
    value_type: CliValueType,
    values: &[String],
-) -> Result<Vec<(i32, mxgateway_client::generated::mxaccess_gateway::v1::MxValue)>, Error> {
+) -> Result<
+    Vec<(
+        i32,
+        mxgateway_client::generated::mxaccess_gateway::v1::MxValue,
+    )>,
+    Error,
+> {
    if item_handles.len() != values.len() {
        return Err(Error::InvalidArgument {
            name: "values".to_owned(),
@@ -1660,4 +1735,77 @@ mod tests {
        assert_eq!(frac.seconds, utc.seconds);
        assert_eq!(frac.nanos, 250_000_000);
    }
+
+    #[test]
+    fn bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram() {
+        use mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult;
+        use mxgateway_client::Error;
+
+        let mut stats = super::BenchReadBulkStats::default();
+        let cached = BulkReadResult {
+            was_cached: true,
+            was_successful: true,
+            ..BulkReadResult::default()
+        };
+        let uncached = BulkReadResult {
+            was_cached: false,
+            was_successful: true,
+            ..BulkReadResult::default()
+        };
+
+        // Two fast successes and one slow failure: the slow failure must
+        // not pollute the success p99/max histogram.
+        stats.record_success(1.5, std::slice::from_ref(&cached));
+        stats.record_success(2.0, std::slice::from_ref(&uncached));
+        let failure = Error::MalformedReply {
+            detail: "synthetic failure for the bench test".to_owned(),
+        };
+        stats.record_failure(1_500.0, &failure);
+
+        assert_eq!(stats.success_latencies_ms, vec![1.5, 2.0]);
+        assert_eq!(stats.failure_latencies_ms, vec![1_500.0]);
+        assert_eq!(stats.successful_calls, 2);
+        assert_eq!(stats.failed_calls, 1);
+        assert_eq!(stats.total_calls(), 3);
+        assert_eq!(stats.total_read_results, 2);
+        assert_eq!(stats.cached_read_results, 1);
+        assert!(stats
+            .first_failure
+            .as_deref()
+            .unwrap()
+            .contains("synthetic failure"));
+
+        let elapsed = std::time::Duration::from_secs(1);
+        let context = super::BenchReadBulkContext {
+            endpoint: "http://fake",
+            client_name: "client",
+            bulk_size: 2,
+            duration_seconds: 1,
+            warmup_seconds: 0,
+            steady_elapsed: elapsed,
+            tags: &[],
+        };
+        let payload = stats.to_json(&context);
+        // The success-latency histogram must never see the 1_500 ms failure.
+        assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 2.0);
+        assert!(payload["latencyMs"]["p99"].as_f64().unwrap() <= 2.0);
+        // The failure-latency histogram must own it instead.
+        assert_eq!(
+            payload["failureLatencyMs"]["max"].as_f64().unwrap(),
+            1_500.0
+        );
+        assert_eq!(payload["failedCalls"].as_u64().unwrap(), 1);
+        assert_eq!(payload["successfulCalls"].as_u64().unwrap(), 2);
+        assert!(payload["firstFailure"]
+            .as_str()
+            .unwrap()
+            .contains("synthetic failure"));
+    }
+
+    #[test]
+    fn bench_read_bulk_stats_calls_per_second_handles_zero_duration() {
+        let stats = super::BenchReadBulkStats::default();
+
+        assert_eq!(stats.calls_per_second(std::time::Duration::ZERO), 0.0);
+    }
 }