Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including one High-severity regression I introduced in the prior sweep — and fixed them all in one parallel wave. High (1) - Client.Python-018: prior sweep set `license = "Proprietary"` in pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the string (it must be a valid SPDX expression), so `pip wheel .` and `pip install -e .` both fail before any source compiles. Tests still pass because pytest bypasses the build backend via `pythonpath`. Dropped the invalid license string, kept the `License :: Other/Proprietary License` classifier, and added `tests/test_packaging.py` so a future regression of the same shape is caught in CI. Mediums (6) - Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace) on WorkerPipeSessionOptions bounds the in-flight-command watchdog suppression so a truly stuck COM call still triggers StaHung instead of permanently defeating the watchdog. - Client.Rust-018: reverted Rust's `latencyMs` split so the cross-language bench comparison is apples-to-apples again; `failureLatencyMs` kept as Rust-only enrichment. - Client.Java-021: applied Client.Java-002's terminal-state serialisation pattern to DeployEventStream so close() arriving after queue-overflow can't erase the overflow exception. - IntegrationTests-017: teardown-parity test now uses a two-window stability check after UnAdvise instead of strict equality against the pre-UnAdvise count (which raced against in-flight events). - IntegrationTests-019: new RecordingTestOutputHelper wraps every log sink the WriteSecured live test owns (worker stdout/stderr, gateway logs, direct WriteLine) so the credential is proven absent from the full output buffer, not just the diagnostic message. - Tests-020: added MxAccessGatewayServiceConstraintTests coverage for the previously-uncovered Write2Bulk and WriteSecured2Bulk arms of WriteBulkConstraintPlan.SetPayload. Lows (41 — highlights) - Server: Galaxy glob cache eviction is race-free (Server-024); GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025); AlarmsOptions validated at startup (Server-026); Authorization.md Constraint Enforcement snippet/prose enumerate the bulk write/read family (Server-027); bulk-read-commands and bulk-write-commands capability tokens added to OpenSession (Server-029); NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and state-machine tests cleaned up (023, 028). - Worker: AlarmCommandHandler now invokes the same STA-affinity guard the poll path uses, at every command entry (Worker-024); RunAsync null-checks the runtime-session factory result (Worker-025). - Worker.Tests: shared LiveMxAccessOptInVariableName lives on GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting rejects production sinks (Worker.Tests-026); FakeRuntimeSession's CancelCommandReturnValue serialised under lock (Worker.Tests-027); Probes namespace lifted to MxGateway.Worker.Tests.Probes (Worker.Tests-029); cancel-envelope sequence numbers monotonised (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes" section (Worker.Tests-028). - Tests: ManualTimeProvider consolidated into one TestSupport/ copy (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation test backed by a TaskCompletionSource fake (Tests-022); companion FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal (Tests-023); constraint plan reply-count divergence pinned (Tests-024). - IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)] end-to-end (IntegrationTests-018); abnormal-exit keyword set tightened to pipe-disconnected/end-of-stream and the test now asserts streamTask.IsFaulted (020, 021). - Client.Dotnet: bench commands added to isLongRunning so the default 30s wall-clock budget doesn't kill them (015); BenchStreamEventsAsync observes the inner stream task on every exit path (016). - Client.Go: parseValue wraps strconv errors with flag context and %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses RFC3339Nano with fractional seconds (019); runStreamEvents installs signal.NotifyContext like runGalaxyWatch (020); five new CLI-level table-driven tests cover the bulk/bench subcommands (021). - Client.Java: toCompletable Javadoc rewritten to match the actual cancellation contract Client.Java-015 established (022); stream-events text path uses Long.toUnsignedString for worker_sequence (023); bench-read-bulk no longer pollutes success-latency histogram with failure durations (024); --shutdown-timeout CLI option propagates through to ClientOptions (025); seven new MxGatewayCliTests cover the bulk and bench commands (026). - Client.Python: mxgateway_cli ships its own py.typed marker (019); wheel-build smoke test added under tests/test_packaging.py (020); README documents the Galaxy CLI parity gap explicitly (021). - Client.Rust: RustClientDesign.md signatures match session.rs and document the AsRef<str> read_bulk genericism (019); next_correlation_id re-exported at the crate root, with a property-style doc contract and an explicit disclaimer that the literal textual format is not part of the contract (020). - Contracts: BulkWriteResult comment names the actual IConstraintEnforcer mechanism instead of "tag-allowlist filter" (014); BulkReadResult gains explicit per-arm payload-population documentation for the success vs failure cases (015). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
@@ -93,23 +93,38 @@ impl Session {
    pub async fn subscribe_bulk(&self, server_handle: i32, tag_addresses: Vec<String>) -> Result<Vec<SubscribeResult>, Error>;
    pub async fn unsubscribe_bulk(&self, server_handle: i32, item_handles: Vec<i32>) -> Result<Vec<SubscribeResult>, Error>;
    pub async fn write(&self, server_handle: i32, item_handle: i32, value: MxValue, user_id: i32) -> Result<(), Error>;
-    pub async fn write_bulk(&self, server_handle: i32, entries: Vec<WriteBulkEntry>, user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
-    pub async fn write2_bulk(&self, server_handle: i32, entries: Vec<Write2BulkEntry>, timestamp: prost_types::Timestamp, user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
-    pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec<WriteSecuredBulkEntry>, current_user_id: i32, verifier_user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
-    pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec<WriteSecured2BulkEntry>, timestamp: prost_types::Timestamp, current_user_id: i32, verifier_user_id: i32) -> Result<Vec<BulkWriteResult>, Error>;
-    pub async fn read_bulk(&self, server_handle: i32, tags: &[String], timeout_ms: u32) -> Result<Vec<ReadBulkResult>, Error>;
+    pub async fn write_bulk(&self, server_handle: i32, entries: Vec<WriteBulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
+    pub async fn write2_bulk(&self, server_handle: i32, entries: Vec<Write2BulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
+    pub async fn write_secured_bulk(&self, server_handle: i32, entries: Vec<WriteSecuredBulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
+    pub async fn write_secured2_bulk(&self, server_handle: i32, entries: Vec<WriteSecured2BulkEntry>) -> Result<Vec<BulkWriteResult>, Error>;
+    pub async fn read_bulk<S: AsRef<str>>(&self, server_handle: i32, tag_addresses: &[S], timeout_ms: u32) -> Result<Vec<BulkReadResult>, Error>;
    pub async fn events(&self) -> Result<impl Stream<Item = Result<MxEvent, Error>>, Error>;
    pub async fn close(&self) -> Result<(), Error>;
 }
 ```

-The five bulk-write helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`,
+The four bulk-write helpers (`write_bulk`, `write2_bulk`, `write_secured_bulk`,
 `write_secured2_bulk`) and `read_bulk` mirror the worker's bulk command shapes
 in `mxaccess_gateway.proto` and use the same correlation-id discipline as the
-unary helpers — `session::next_correlation_id` is `pub` so that consumers
-constructing raw `MxCommandRequest`/`CloseSessionRequest` payloads outside
-the `Session` helpers (notably the `mxgw` test CLI's `ping` and
-`close-session` subcommands) share the same id generation.
+unary helpers — `next_correlation_id` is part of the public SDK surface,
+re-exported at the crate root (`mxgateway_client::next_correlation_id`), so
+that consumers constructing raw `MxCommandRequest`/`CloseSessionRequest`
+payloads outside the `Session` helpers (notably the `mxgw` test CLI's `ping`
+and `close-session` subcommands) share the same id generation. The returned
+id is documented as an opaque token with three guaranteed properties
+(embeds the caller's label, unique within a process, carries no secret);
+its textual format is intentionally *not* part of the contract.
+
+The per-entry fields that the matching MXAccess COM calls accept once per
+batch — `user_id` (`WriteBulkEntry`/`Write2BulkEntry`), `timestamp_value`
+(`Write2BulkEntry`/`WriteSecured2BulkEntry`), and `current_user_id` /
+`verifier_user_id` (`WriteSecuredBulkEntry`/`WriteSecured2BulkEntry`) — live
+on the entry structs themselves rather than as trailing positional arguments
+on the helper, matching the protobuf shapes in
+`mxaccess_gateway.proto` (`WriteBulkCommand` / `Write2BulkCommand` /
+`WriteSecuredBulkCommand` / `WriteSecured2BulkCommand`). `read_bulk` is
+generic over `AsRef<str>` so callers can pass `&[String]` or `&[&str]`
+without cloning at the call site.

 ## Authentication

@@ -447,9 +447,7 @@ async fn run(cli: Cli) -> Result<(), Error> {
            let client = connect(connection).await?;
            let reply = client
                .invoke(MxCommandRequest {
-                    client_correlation_id: mxgateway_client::session::next_correlation_id(
-                        "cli-ping",
-                    ),
+                    client_correlation_id: mxgateway_client::next_correlation_id("cli-ping"),
                    command: Some(MxCommand {
                        kind: MxCommandKind::Ping as i32,
                        payload: Some(mxgateway_client::generated::mxaccess_gateway::v1::mx_command::Payload::Ping(
@@ -496,7 +494,7 @@ async fn run(cli: Cli) -> Result<(), Error> {
            let reply = client
                .close_session_raw(CloseSessionRequest {
                    session_id,
-                    client_correlation_id: mxgateway_client::session::next_correlation_id(
+                    client_correlation_id: mxgateway_client::next_correlation_id(
                        "cli-close-session",
                    ),
                })
@@ -1088,16 +1086,17 @@ async fn run_bench_read_bulk(

 /// Per-iteration accounting for `bench-read-bulk`.
 ///
-/// Only successful `read_bulk` calls contribute to the success-latency
-/// histogram (`success_latencies_ms`). Failures are tracked separately in
-/// `failure_latencies_ms` and the first failure's redacted error string is
-/// stashed in `first_failure` so a partial-failure run is visible in the
-/// emitted JSON. This keeps the cross-language `latencyMs.p99`/`max`
-/// contract honest: it reports successful-call latency only and never
-/// folds in a per-call timeout from a failed RPC.
+/// Every `read_bulk` call's elapsed time contributes to the all-calls
+/// histogram (`latencies_ms`), matching the .NET/Go/Python/Java bench
+/// implementations whose `latencyMs` field is the cross-language comparison
+/// contract collated by `scripts/bench-read-bulk.ps1`. Failures additionally
+/// land in `failure_latencies_ms` and the first failure's redacted error
+/// string is stashed in `first_failure`, both surfaced through the JSON as
+/// Rust-only enrichment so a partial-failure run is still visible at the
+/// report layer without breaking the side-by-side comparison.
 #[derive(Default)]
 struct BenchReadBulkStats {
-    success_latencies_ms: Vec<f64>,
+    latencies_ms: Vec<f64>,
    failure_latencies_ms: Vec<f64>,
    total_read_results: u64,
    cached_read_results: u64,
@@ -1112,7 +1111,7 @@ impl BenchReadBulkStats {
        elapsed_ms: f64,
        results: &[mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult],
    ) {
-        self.success_latencies_ms.push(elapsed_ms);
+        self.latencies_ms.push(elapsed_ms);
        self.successful_calls += 1;
        for result in results {
            self.total_read_results += 1;
@@ -1123,6 +1122,7 @@ impl BenchReadBulkStats {
    }

    fn record_failure(&mut self, elapsed_ms: f64, error: &Error) {
+        self.latencies_ms.push(elapsed_ms);
        self.failure_latencies_ms.push(elapsed_ms);
        self.failed_calls += 1;
        if self.first_failure.is_none() {
@@ -1145,7 +1145,7 @@ impl BenchReadBulkStats {

    fn to_json(&self, context: &BenchReadBulkContext<'_>) -> serde_json::Value {
        let calls_per_second = self.calls_per_second(context.steady_elapsed);
-        let success_summary = percentile_summary(&self.success_latencies_ms);
+        let latency_summary = percentile_summary(&self.latencies_ms);
        let failure_summary = percentile_summary(&self.failure_latencies_ms);
        serde_json::json!({
            "language": "rust",
@@ -1163,7 +1163,7 @@ impl BenchReadBulkStats {
            "totalReadResults": self.total_read_results,
            "cachedReadResults": self.cached_read_results,
            "callsPerSecond": round_to(calls_per_second, 2),
-            "latencyMs": success_summary,
+            "latencyMs": latency_summary,
            "failureLatencyMs": failure_summary,
            "firstFailure": self.first_failure,
        })
@@ -1737,7 +1737,7 @@ mod tests {
    }

    #[test]
-    fn bench_read_bulk_stats_keeps_failures_out_of_success_latency_histogram() {
+    fn bench_read_bulk_stats_tracks_all_calls_in_latency_and_failures_separately() {
        use mxgateway_client::generated::mxaccess_gateway::v1::BulkReadResult;
        use mxgateway_client::Error;

@@ -1753,8 +1753,10 @@ mod tests {
            ..BulkReadResult::default()
        };

-        // Two fast successes and one slow failure: the slow failure must
-        // not pollute the success p99/max histogram.
+        // Two fast successes and one slow failure: every call lands in the
+        // all-calls histogram (the cross-language `latencyMs` contract) and
+        // the failure additionally surfaces through `failureLatencyMs` plus
+        // `firstFailure` as Rust-only enrichment.
        stats.record_success(1.5, std::slice::from_ref(&cached));
        stats.record_success(2.0, std::slice::from_ref(&uncached));
        let failure = Error::MalformedReply {
@@ -1762,7 +1764,7 @@ mod tests {
        };
        stats.record_failure(1_500.0, &failure);

-        assert_eq!(stats.success_latencies_ms, vec![1.5, 2.0]);
+        assert_eq!(stats.latencies_ms, vec![1.5, 2.0, 1_500.0]);
        assert_eq!(stats.failure_latencies_ms, vec![1_500.0]);
        assert_eq!(stats.successful_calls, 2);
        assert_eq!(stats.failed_calls, 1);
@@ -1786,10 +1788,12 @@ mod tests {
            tags: &[],
        };
        let payload = stats.to_json(&context);
-        // The success-latency histogram must never see the 1_500 ms failure.
-        assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 2.0);
-        assert!(payload["latencyMs"]["p99"].as_f64().unwrap() <= 2.0);
-        // The failure-latency histogram must own it instead.
+        // The all-calls histogram (cross-language `latencyMs` contract)
+        // includes the failure latency so the side-by-side comparison with
+        // .NET/Go/Python/Java stays apples-to-apples.
+        assert_eq!(payload["latencyMs"]["max"].as_f64().unwrap(), 1_500.0);
+        // The Rust-only `failureLatencyMs` enrichment surfaces failures
+        // separately for partial-failure diagnostics.
        assert_eq!(
            payload["failureLatencyMs"]["max"].as_f64().unwrap(),
            1_500.0
@@ -32,7 +32,7 @@ pub use galaxy::{DeployEventStream, GalaxyClient};
 #[doc(inline)]
 pub use options::ClientOptions;
 #[doc(inline)]
-pub use session::Session;
+pub use session::{next_correlation_id, Session};
 #[doc(inline)]
 pub use value::{MxArrayProjection, MxArrayValue, MxStatus, MxValue, MxValueProjection};
 #[doc(inline)]
@@ -37,8 +37,20 @@ static CORRELATION_SEQUENCE: AtomicU64 = AtomicU64::new(0);
 /// Exposed so consumers that construct raw [`MxCommandRequest`] /
 /// [`CloseSessionRequest`] payloads outside the `Session` helpers — notably
 /// the `mxgw` test CLI — share the same correlation-id discipline as the
-/// library. The returned id is `rust-client-{label}-{N}` where `N` comes
-/// from a process-wide atomic sequence.
+/// library. Also re-exported at the crate root as
+/// [`mxgateway_client::next_correlation_id`](crate::next_correlation_id).
+///
+/// The returned id has the following guaranteed properties:
+///
+/// - it embeds the supplied `label` verbatim so log readers can pick out
+///   which call site emitted it;
+/// - it is unique within the lifetime of a single process (driven by an
+///   internal monotonically-increasing atomic sequence);
+/// - it carries no embedded secret or user-supplied payload beyond `label`.
+///
+/// The exact textual format (currently `rust-client-{label}-{N}`) is *not*
+/// part of the public contract and may change between releases — callers
+/// must not parse it. Treat the returned `String` as an opaque token.
 #[must_use]
 pub fn next_correlation_id(label: &str) -> String {
    let sequence = CORRELATION_SEQUENCE.fetch_add(1, Ordering::Relaxed);