fix(client/rust): handle provider_status arm (build break); real system-roots TLS; design doc (Client.Rust-030..032)
This commit is contained in:
@@ -4,8 +4,8 @@
|
||||
|---|---|
|
||||
| Module | `clients/rust` |
|
||||
| Reviewer | Claude Code |
|
||||
| Review date | 2026-05-24 |
|
||||
| Commit reviewed | `42b0037` |
|
||||
| Review date | 2026-06-15 |
|
||||
| Commit reviewed | `410acc9` |
|
||||
| Status | Re-reviewed |
|
||||
| Open findings | 0 |
|
||||
|
||||
@@ -96,6 +96,25 @@ under review does not address them.
|
||||
| 9 | Testing coverage | Issue found: zero tests cover `stream_alarms` on `GatewayClient`, the new bulk read/write SDK methods, or the `BenchReadBulk` flow; the fake gateway's `stream_alarms` impl drops the sender immediately (Client.Rust-024). |
|
||||
| 10 | Documentation & comments | Issue found: `.cargo/config.toml`'s comment promises "Release builds are unaffected" but the `link-arg=/STACK:8388608` setting is unconditional under `cfg(windows)` and only applies to the MSVC linker (Client.Rust-027). |
|
||||
|
||||
### 2026-06-15 re-review (commit 410acc9)
|
||||
|
||||
Re-review pass at `410acc9`. The diff against `42b0037` (`git diff 42b0037..HEAD -- clients/rust/`) covers: Cargo metadata + Gitea alternative-registry config (`Cargo.toml`, `.cargo/config.toml`, README install section); a `[registries.dohertj2-gitea]` index entry and `publish = ["dohertj2-gitea"]` with `mxgw-cli` set `publish = false`; the resolution work for Client.Rust-022..029 (malformed-reply `Result` plumbing, `next_correlation_id` re-export, clippy fixes, `read_bulk<S: AsRef<str>>`); a **new** Galaxy lazy-browse walker (`browse`, `browse_children_raw`, `browse_children_inner`, `BrowseChildrenOptions`, `LazyBrowseNode`) with six unit tests; a **new** TLS pin-only guard (`build_tls_config` + `ClientOptions::with_require_certificate_validation` + `--require-certificate-validation` CLI flag) with a new `tests/tls.rs`; and the alarm-provider-fallback proto surface (`AlarmFeedMessage.provider_status`, added contracts-side in `1d85db7`).
|
||||
|
||||
`cargo fmt --check` is clean. `cargo check -p zb-mom-ww-mxgateway-client`, `cargo test -p zb-mom-ww-mxgateway-client` (24 lib + integration, 4 proto-fixture, 4 tls — all pass), and the library half of the workspace are clean. **`cargo clippy --workspace --all-targets -- -D warnings` and `cargo check --workspace` both FAIL at HEAD** — not on a lint but on a hard `E0004` compile error: the `mxgw-cli` binary's two `match &message.payload` blocks (`crates/mxgw-cli/src/main.rs:1731,1757`) are non-exhaustive after the proto added `AlarmFeedMessage.payload::ProviderStatus` (Client.Rust-030). The library crate compiles and all its tests pass; the break is confined to the CLI binary. No committed registry tokens — `.cargo/config.toml` carries only the sparse index URL; the README documents the token living in `~/.cargo/credentials.toml`.
|
||||
|
||||
| # | Category | Result |
|
||||
|---|---|---|
|
||||
| 1 | Correctness & logic bugs | Issue found: `mxgw-cli` fails to compile at HEAD — non-exhaustive `AlarmFeedMessage.payload` match missing the new `ProviderStatus` arm (Client.Rust-030). The library `read_bulk`/galaxy-walker/TLS-guard logic is correct and tested. |
|
||||
| 2 | mxaccessgw conventions | Issue found: `cargo clippy --workspace --all-targets -- -D warnings` / `cargo check --workspace` do not pass — CLAUDE.md mandates they do (Client.Rust-030). The prior 029 clippy regressions are resolved; this is a new build break from the alarm-provider proto change. |
|
||||
| 3 | Concurrency & thread safety | No issues found — `LazyBrowseNode` shares state via `Arc<…AsyncMutex<…>>`; `expand()` holds the mutex across the `browse_children_inner` await so concurrent expanders serialize and the idempotency check is race-free. `CORRELATION_SEQUENCE` is still `AtomicU64`/`Relaxed`. No `unsafe`. |
|
||||
| 4 | Error handling & resilience | Issue found: the strict TLS path (`require_certificate_validation(true)` with no CA) builds a `ClientTlsConfig` with zero trust roots (no `tls-native-roots`/`tls-webpki-roots` feature, no `.with_*_roots()` call), so it cannot validate any certificate — contradicting the documented "verify against the system trust roots" behaviour (Client.Rust-031). The galaxy page-token loop has a correct repeated-token guard. |
|
||||
| 5 | Security | No issues found in the registry/secret surface — `.cargo/config.toml` holds only the sparse index URL, no token; README puts the Bearer token in `~/.cargo/credentials.toml` (uncommitted). (See Client.Rust-031 for the strict-TLS validation gap, classified under error handling.) |
|
||||
| 6 | Performance & resource management | No issues found — `read_bulk` is now borrow-based (`&[S]`), the bench loop reuses `tags_ref` (Client.Rust-026 resolved). The walker clones the `GalaxyClient` channel handle per node, which is the intended cheap `Channel` clone. |
|
||||
| 7 | Design-document adherence | Issue found: `RustClientDesign.md` is not updated for the new Galaxy lazy-browse SDK surface (`browse` / `browse_children_raw` / `LazyBrowseNode` / `BrowseChildrenOptions`); CLAUDE.md requires docs to change with the source (Client.Rust-032). The TLS pin-only section pre-dates this diff but repeats the inaccurate "system trust roots" claim (cross-referenced from Client.Rust-031). |
|
||||
| 8 | Code organization & conventions | No issues found — Cargo metadata (name/version/license/repository/keywords/categories) is well-formed; `publish = ["dohertj2-gitea"]` on the library and `publish = false` on `mxgw-cli` is the right split. `license = "Proprietary"` is non-SPDX but cargo accepts it and it is a deliberate closed-source marker. |
|
||||
| 9 | Testing coverage | No issues found in the new surface — the walker has six unit tests (roots, expand, idempotency, NotFound, multi-page, filter-forwarding) and TLS has four. Gap noted: `tls_with_require_certificate_validation_does_not_short_circuit` connects to a dead address, so it only asserts the guard does not fire and never exercises a real handshake — which is why the no-trust-roots defect in Client.Rust-031 is not caught by a test. |
|
||||
| 10 | Documentation & comments | Issue found: the `alarm_feed_message_summary` / `alarm_feed_message_to_json` doc comments still say "three `payload` oneof cases" (`main.rs:1729,1755`) although the proto now has four; folded into Client.Rust-030's fix. The TLS doc inaccuracy is Client.Rust-031. |
|
||||
|
||||
## Findings
|
||||
|
||||
### Client.Rust-001
|
||||
@@ -687,3 +706,59 @@ The third error (`BulkReplyKind` enum-variant-names) is also touched by the diff
|
||||
**Recommendation:** Re-apply Client.Rust-001 (add doc comments on `with_max_grpc_message_bytes` / `max_grpc_message_bytes` in `options.rs`), Client.Rust-002 (drop the `Bulk` suffix from `BulkReplyKind`'s variants so they become `AddItem` / `AdviseItem` / …, or add a narrowly-scoped `#[allow(clippy::enum_variant_names)]` with a reason comment), and Client.Rust-012 (replace `last_deploy.lock().unwrap().clone()` with `*last_deploy.lock().unwrap()` in `galaxy.rs:282`). Verify with `cargo clippy --workspace --all-targets -- -D warnings`. Consider adding a pre-commit / CI gate so the next reviewer never has to discover the regression by running clippy.
|
||||
|
||||
**Resolution:** 2026-05-24 — Re-applied all three resolutions. `clients/rust/src/options.rs` now has `///` doc comments on `with_max_grpc_message_bytes` and `max_grpc_message_bytes`. `clients/rust/src/galaxy.rs:282` uses `*self.state.last_deploy.lock().unwrap()` instead of `.clone()`. `clients/rust/src/session.rs`'s `BulkReplyKind` variants are renamed to `AddItem` / `AdviseItem` / `RemoveItem` / `UnAdviseItem` / `Subscribe` / `Unsubscribe` (no shared `Bulk` suffix), with the call sites in `add_item_bulk` / `advise_item_bulk` / `remove_item_bulk` / `un_advise_item_bulk` / `subscribe_bulk` / `unsubscribe_bulk` updated accordingly. The sibling `BulkWriteReplyKind` already had non-suffix-sharing variants (`Write` / `Write2` / `WriteSecured` / `WriteSecured2`) and required no rename. `cargo clippy --workspace --all-targets -- -D warnings` is clean at HEAD.
|
||||
|
||||
### Client.Rust-030
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | High |
|
||||
| Category | Correctness & logic bugs |
|
||||
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1731,1757` (origin: `src/ZB.MOM.WW.MxGateway.Contracts/Protos/mxaccess_gateway.proto:909-924`, added in commit `1d85db7`) |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The `mxgw-cli` binary does not compile at HEAD `410acc9`. `cargo check --workspace`, `cargo clippy --workspace --all-targets -- -D warnings`, `cargo build --workspace`, and `cargo test --workspace` all fail with a hard `E0004` (non-exhaustive patterns), so the entire documented Rust build/test/clippy workflow that CLAUDE.md mandates is broken:
|
||||
|
||||
```
|
||||
error[E0004]: non-exhaustive patterns: `&Some(...alarm_feed_message::Payload::ProviderStatus(_))` not covered
|
||||
--> crates/mxgw-cli/src/main.rs:1731:11
|
||||
error[E0004]: non-exhaustive patterns: `&Some(...alarm_feed_message::Payload::ProviderStatus(_))` not covered
|
||||
--> crates/mxgw-cli/src/main.rs:1757:11
|
||||
```
|
||||
|
||||
The alarm-provider-fallback contract change (`1d85db7`, within the reviewed range) added a fourth `AlarmFeedMessage.payload` oneof arm — `AlarmProviderStatus provider_status = 4`. tonic-build regenerates the Rust enum with the new `ProviderStatus` variant, but `alarm_feed_message_summary` (`main.rs:1731`) and `alarm_feed_message_to_json` (`main.rs:1757`) each `match &message.payload` exhaustively over only `ActiveAlarm` / `SnapshotComplete` / `Transition` / `None` with no wildcard arm. Because they are exhaustive matches on a now-larger enum, the binary fails to compile rather than silently mishandling the new variant. The library crate (`zb-mom-ww-mxgateway-client`) itself compiles cleanly and all 32 of its tests pass; the break is confined to the CLI — but the CLI is the cross-language e2e matrix driver, so the whole `clients/rust` workspace is unbuildable and no Rust e2e smoke can run against the gateway at this commit. This is the alarm-surface gap the review request asked to check: the `ProviderStatus` payload is unhandled in the only place the Rust client renders the alarm feed.
|
||||
|
||||
**Recommendation:** Add a `Some(alarm_feed_message::Payload::ProviderStatus(status))` arm to both `alarm_feed_message_summary` and `alarm_feed_message_to_json` (render the provider-status fields — mode, degraded/provenance, reference — consistent with how the .NET/Go/Java/Python CLIs serialise it so the cross-language parity matcher recognises the payload). While there, update the two doc comments that still say "three `payload` oneof cases" (`main.rs:1729,1755`) to four. Verify with `cargo clippy --workspace --all-targets -- -D warnings` and `cargo test --workspace`. Consider a CI gate so a contract change that adds a oneof arm cannot leave the Rust CLI unbuildable again.
|
||||
|
||||
**Resolution:** 2026-06-15 — Root cause confirmed: the contract's new fourth `AlarmFeedMessage.payload` oneof arm (`AlarmProviderStatus provider_status`, proto fields `mode`/`degraded`/`reason`/`since`) left both `match &message.payload` blocks non-exhaustive (`E0004`). Added a `Some(alarm_feed_message::Payload::ProviderStatus(status))` arm to both `alarm_feed_message_summary` (`mode`/`degraded`/`reason` one-liner) and `alarm_feed_message_to_json` (a `providerStatus` object with `mode`/`degraded`/`reason`/`since`), added an `AlarmEnumName::provider_mode` enum-name helper consistent with the existing `condition_state`/`transition_kind` renderers, and updated the summary doc comment to "four payload oneof cases". No `_ => {}` wildcard. Test: `alarm_feed_provider_status_renders_in_summary_and_json` (in `crates/mxgw-cli/src/main.rs`). All four cargo commands now pass.
|
||||
|
||||
### Client.Rust-031
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Medium |
|
||||
| Category | Error handling & resilience |
|
||||
| Location | `clients/rust/src/options.rs:196-240` (`build_tls_config`); `clients/rust/Cargo.toml:40` (tonic features); docs: `clients/rust/src/options.rs:76-101`, `clients/rust/README.md` (TLS trust section), `clients/rust/crates/mxgw-cli/src/main.rs:429-431`, `clients/rust/RustClientDesign.md:202` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The new strict-verification escape hatch does not do what it documents. `build_tls_config` only configures trust roots when a CA file is pinned: with `require_certificate_validation(true)` and no `ca_file`, it returns a bare `ClientTlsConfig::new()` and never calls `.with_native_roots()`, `.with_webpki_roots()`, or `.with_enabled_roots()`. The crate also enables only `tonic` feature `tls-ring` (`Cargo.toml:40`) — neither `tls-native-roots` nor `tls-webpki-roots` is on, so even if the code wanted to enable system roots the methods are feature-gated out. A `ClientTlsConfig` with zero trust anchors rejects every server certificate during the rustls handshake, so the strict path cannot connect to any TLS gateway — not even one whose certificate is genuinely chained to a system root. Yet `with_require_certificate_validation`'s doc comment (`options.rs:80-89`), the README "TLS trust (pin-only)" section, the `--require-certificate-validation` CLI flag help (`main.rs:429-431`), and `RustClientDesign.md:202` all tell the user this option will "verify against the system trust roots." The documented behaviour is unreachable; the only working TLS path is CA pinning (`with_ca_file`).
|
||||
|
||||
This is masked by the tests: `tls_with_require_certificate_validation_does_not_short_circuit` (`tests/tls.rs`) dials a dead address (`https://127.0.0.1:1`) and only asserts the no-CA guard error does *not* fire — it never reaches a handshake, so the absent-roots defect is invisible to the suite.
|
||||
|
||||
**Recommendation:** Either (a) make the strict path actually load system roots — add the `tls-native-roots` (and/or `tls-webpki-roots`) feature to the `tonic` dependency and call `tls = tls.with_native_roots()` (or `.with_enabled_roots()`) in the `require_certificate_validation == true && ca_file.is_none()` branch of `build_tls_config` — and add a test that pins a self-signed cert as a CA and asserts a system-root-only connection to that same server is *rejected* (proving roots are actually consulted); or (b) if loading system roots is intentionally out of scope for v1, correct every doc site (the `with_require_certificate_validation` doc comment, README, CLI flag help, and `RustClientDesign.md`) to state that the strict flag does not currently enable any trust roots and that CA pinning is the only supported TLS path. Option (a) is the better fix because the flag otherwise has no working effect.
|
||||
|
||||
**Resolution:** 2026-06-15 — Took option (a). Root cause confirmed: strict-on/no-CA returned a bare `ClientTlsConfig::new()` with zero trust anchors and the crate only enabled tonic `tls-ring`, so the documented "verify against the system trust roots" path could never validate any certificate. Added `tls-native-roots` to the `tonic` features in `Cargo.toml` and refactored `build_tls_config` to compute the trust posture via a new pure `tls_trust_decision` helper returning `TlsTrustDecision::{None,PinnedCa,SystemRoots,RejectNoCa}`; the `SystemRoots` branch now calls `ClientTlsConfig::with_native_roots()` so a cert chaining to an OS-trusted root validates. Corrected every doc site to state the strict flag verifies against OS roots (not a bare self-signed cert, which still needs `with_ca_file`): the `with_require_certificate_validation` doc comment and `build_tls_config` docs (`options.rs`), README "TLS trust" section, and `RustClientDesign.md` "Trust posture"; the CLI flag help was already accurate. TDD: added failing-first unit tests then the fix — `strict_without_ca_uses_system_roots`, `lenient_without_ca_is_rejected`, `pinned_ca_uses_pinned_trust`, `plaintext_needs_no_tls` (in `src/options.rs`). All four cargo commands pass.
|
||||
|
||||
### Client.Rust-032
|
||||
|
||||
| Field | Value |
|
||||
|---|---|
|
||||
| Severity | Low |
|
||||
| Category | Design-document adherence |
|
||||
| Location | `clients/rust/RustClientDesign.md`; surface in `clients/rust/src/galaxy.rs:281-379` |
|
||||
| Status | Resolved |
|
||||
|
||||
**Description:** The diff under review adds substantial new public Galaxy SDK surface — `GalaxyClient::browse`, `GalaxyClient::browse_children_raw`, the `BrowseChildrenOptions` filter struct, and the `LazyBrowseNode` lazy walker (`object`, `has_children_hint`, `children`, `is_expanded`, `expand`) — none of which is described in `RustClientDesign.md`. The README was updated with a "Browsing lazily" / "High-level walker" section, but CLAUDE.md requires the design docs to change in the same change as the public API. A reader consulting the detailed design to understand the Galaxy client surface will not learn that lazy browsing, sibling pagination, the `child_has_children` hint, or the idempotent `expand` contract exist.
|
||||
|
||||
**Recommendation:** Add a "Lazy browse" subsection to the Galaxy section of `RustClientDesign.md` enumerating `browse`, `browse_children_raw`, `BrowseChildrenOptions` (its filter fields and AND semantics), and `LazyBrowseNode` (the `Arc`-shared clone semantics, the idempotent single-RPC `expand`, the `has_children_hint`, and the internal paged `BrowseChildren` loop with its repeated-page-token guard). Cross-reference `docs/GalaxyRepository.md#browsechildren` for the wire-level request/filter semantics the README already links.
|
||||
|
||||
**Resolution:** 2026-06-15 — Confirmed by inspection that `RustClientDesign.md` had no Galaxy library-API coverage at all. Added a new "Galaxy Repository" section documenting `browse`, `browse_children_raw`, the `BrowseChildrenOptions` filter struct (all six fields, AND combination semantics, `include_attributes` tri-state), and `LazyBrowseNode` (`Arc`-shared clone semantics, `has_children_hint`, the idempotent single-RPC `expand` under an async mutex with page size 500, and the repeated-page-token `Error::InvalidArgument` guard), cross-referencing `docs/GalaxyRepository.md#browsechildren`. Also noted the fourth alarm `provider_status` oneof case in the Alarms section while resolving Client.Rust-030. Doc-only change verified by inspection; design-doc anchor target confirmed present.
|
||||
|
||||
Reference in New Issue
Block a user