Compare commits

...

11 Commits

Author SHA1 Message Date
Joseph Doherty 6079c62709 code-reviews: regenerate index at 42b0037
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:56 -04:00
Joseph Doherty 37ef27e8ed code-reviews: bump Worker + Worker.Tests headers to 42b0037
No source changes since d692232 — header bumped to track the latest
reviewed commit, all prior findings remain closed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:56 -04:00
Joseph Doherty db2218f395 code-reviews: re-review Client.Java at 42b0037
Append 5 new findings (Client.Java-032..036): README flags for new
alarm subcommands do not exist; StreamAlarmsCommand bounded queue
silently drops alarms instead of fail-fast; BatchCommand whitespace
tokenisation shreds quoted args; no library-side stream_alarms test;
MxGatewayAlarmFeedSubscription is the fourth duplicate subscription
class.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:55 -04:00
Joseph Doherty bc28fee641 code-reviews: re-review Client.Python at 42b0037
Append 5 new findings (Client.Python-022..026): README flags for new
alarm subcommands do not exist; Client.Python-013 regression — the
silent localhost auto-plaintext branch is still present (the prior
Resolution did not survive the rename); production batch path uses
the click.testing.CliRunner helper; no behavioural tests for new SDK
+ CLI; bench cleanup swallows exceptions.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:55 -04:00
Joseph Doherty 15fceed536 code-reviews: re-review Client.Rust at 42b0037
Append 8 new findings (Client.Rust-022..029): MalformedReply path
absent from the new bulk SDK methods, hard-coded client_correlation_id
in new CLI commands, no tests for stream_alarms / bulk SDK / bench,
RustClientDesign.md silent on new surface, run_bench_read_bulk clones
tags inside the loop, .cargo/config.toml comment is wrong, run_batch
exits on blank line, and cargo clippy fails at HEAD with three
warnings the prior reviewer punted.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:41 -04:00
Joseph Doherty afa82e0989 code-reviews: re-review Client.Go at 42b0037
Append 6 new findings (Client.Go-022..027): regressions of Client.Go-015
(runWriteBulkVariant secured flag) and Client.Go-018 (bench loop
ignores ctx.Err()) reintroduced by the bulk port; no SDK tests for the
new WriteBulk/ReadBulk/StreamAlarms methods; bulk SDK accepts empty
slices; bufio.Scanner buffer + blank-line sentinel can abort the batch
session.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:41 -04:00
Joseph Doherty b9ef09d26e code-reviews: re-review Client.Dotnet at 42b0037
Append 4 new findings (Client.Dotnet-018..021): README flags for the
new stream-alarms/acknowledge-alarm subcommands cite options that do
not exist on the CLI; BenchReadBulkAsync reinstates the silent
register-handle fallback and swallows OperationCanceledException;
both new --timeout-ms consumers cast int32 to uint without bounds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:40 -04:00
Joseph Doherty 7d66967122 code-reviews: re-review IntegrationTests at 42b0037
Append 1 new finding (IntegrationTests-025): the
ResolveRepositoryRoot_NoMarkers test walks up from Path.GetTempPath()
through unisolated ancestors; a redirected TMP or co-located checkout
at C:\src silently breaks the assertion.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:26 -04:00
Joseph Doherty 2f8404d2ef code-reviews: re-review Contracts at 42b0037
No new findings — the only Contracts commit in window (bd1d1f1) is a
comment-only proto edit; field numbering remains additive with no
reuse, renumbering, or type narrowing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:25 -04:00
Joseph Doherty 2b92be02b9 code-reviews: re-review Tests at 42b0037
Append 5 new findings (Tests-027..031) covering the
StreamEvents_WhenEventIsWritten_RecordsSendDuration flake root cause
(shared MeterListener by meter name), missing kill-path coverage
(reason propagation + concurrent-kill double-count), asymmetric guard
coverage between Close and Kill, missing audit-failure-path coverage
for ApiKey Delete, and the DashboardSnapshotPublisher reconnect-window
timer sensitivity.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:15 -04:00
Joseph Doherty 056f0d8808 code-reviews: re-review Server at 42b0037
Append 7 new findings (Server-044..050) covering the destructive-action
wave: KillWorkerAsync metric/state leaks, ShutdownAsync kill-fallback
gauge leak, inconsistent ConfirmDialog cleanup across pages, missing
XML docs on the new DashboardSessionAdmin surface, and unhandled
RemoveSessionAsync exception paths.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-24 08:28:15 -04:00
12 changed files with 1401 additions and 42 deletions
+166 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/dotnet` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 4 |
## Checklist coverage
@@ -336,3 +336,166 @@ The secondary `Grpc.Core.RpcException` catch on line 975 is also dead in this co
**Recommendation:** Wrap the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { /* graceful */ }`. The `when` clause ensures only the supplied cancellation token (which covers `--timeout`, Ctrl+C, and parent-CTS cancellation — all three of which are graceful completion modes for a finite-window collector) gets absorbed; a server-side cancellation propagated through a different token still surfaces. Keep the existing aggregate-JSON emission below the catch so the events that arrived before the budget closed are still emitted. Add a regression test that drives the CLI with `--timeout 1s` against a fake that yields a couple of events then parks on the cancellation token; assert exit 0, no stderr, and the JSON output contains both yielded events.
**Resolution:** 2026-05-20 — Wrapped the `await foreach` in `try { ... } catch (OperationCanceledException) when (cancellationToken.IsCancellationRequested) { }` so the CLI exits 0 and emits the aggregate `{ "events": [...] }` JSON when the supplied token cancels (the `--timeout`, Ctrl+C, and parent-CTS paths all flow through that same token). The catch's `when` clause ensures non-token-driven cancellation still propagates. Added regression test `MxGatewayClientCliTests.RunAsync_StreamEvents_WhenTimeoutFiresAfterEvents_EmitsCollectedEventsAndExitsZero` that yields two events, parks on the cancellation token via a new `FakeCliClient.StreamHangAfterEvents` hook, runs the CLI with `--timeout 1s --json --max-events 200`, and asserts exit code 0, empty stderr, and both events present in the emitted aggregate JSON. Brings .NET stream-events behavior into parity with the Go, Rust, Python, and Java CLIs which all exit 0 on equivalent timeouts.
#### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. Diff against `d692232` consists of four commits:
- `71d2c39` — ports the `batch` subcommand to all five client CLIs (the .NET
half adds `RunBatchAsync`, the `__MXGW_BATCH_EOR__` sentinel, the
`forceJsonErrors` plumbing on `RunCoreAsync`, and the two
`RunAsync_Batch_*` tests).
- `b3ae200` — ports the bulk read/write SDK methods + matching CLI
subcommands (the SDK side adds `WriteBulkAsync` / `Write2BulkAsync` /
`WriteSecuredBulkAsync` / `WriteSecured2BulkAsync` / `ReadBulkAsync` on
`MxGatewaySession`; the CLI side adds `read-bulk`, `write-bulk`,
`write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, and the
cross-language stress harness `bench-read-bulk`).
- `11cc671` — ports `stream-alarms` and `acknowledge-alarm` to the CLI and
the SDK transport (`StreamAlarmsAsync` on `MxGatewayClient`,
`IMxGatewayClientTransport`, `GrpcMxGatewayClientTransport`,
`IMxGatewayCliClient`, `MxGatewayCliClientAdapter`, and
`FakeGatewayTransport`; two new `MxGatewayClientCliTests` covering
acknowledge + stream-alarms with `--max-events` and payload-case
distinction). The same commit wraps `StreamEventsAsync`'s `await foreach`
with `catch (OperationCanceledException) when
(cancellationToken.IsCancellationRequested)` (matching the Client.Dotnet-017
resolution shape) and mirrors the catch on the new `StreamAlarmsAsync`
CLI handler. `RunBatchAsync`'s outer catch is `catch (Exception
exception)` with no `when` guard — OCE thrown out of `RunCoreAsync` is
caught here and written as a JSON error so the batch process keeps
going, while the inner `RunCoreAsync` catch retains its `when
(exception is not OperationCanceledException)` guard so the non-batch
CLI behaviour is unchanged.
- `8738735` — updates each client's README with `stream-alarms` /
`acknowledge-alarm` usage and SDK examples.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found (this review): `BenchReadBulkAsync` repeats the silent `?? reply.ReturnValue.Int32Value` register-handle fallback at a brand-new call site (Client.Dotnet-019). Two new `(uint)` casts on user-supplied `--timeout-ms` values silently wrap negatives (Client.Dotnet-021). |
| 2 | mxaccessgw conventions | Issue found (this review): README CLI examples for `stream-alarms` and `acknowledge-alarm` use flag names that don't exist in the CLI, violating the docs-in-the-same-change rule (Client.Dotnet-018). |
| 3 | Concurrency & thread safety | No issues found in this diff — the new `RunBatchAsync` is single-threaded over its stdin loop; `StreamAlarmsAsync`'s gRPC transport reader follows the same shape as the existing `StreamEventsAsync` / `QueryActiveAlarmsAsync`. |
| 4 | Error handling & resilience | Issue found (this review): `BenchReadBulkAsync`'s steady-state loop has a bare `catch { ... continue; }` that swallows `OperationCanceledException`, so user-driven cancellation does not exit the bench until the wall-clock budget elapses (Client.Dotnet-020). The new `StreamAlarmsAsync` CLI handler mirrors the Client.Dotnet-017 OCE-graceful pattern correctly. |
| 5 | Security | No issues found in this diff — the bulk write/write-secured SDK methods inherit `EnsureProtocolSuccess`/`EnsureMxAccessSuccess` shapes; per-entry secured user ids stay on the request, and the redactor still covers CLI error paths. |
| 6 | Performance & resource management | No issues found in this diff (the bench tight-loop CPU burn under cancellation is captured under Client.Dotnet-020 against the error-handling category). |
| 7 | Design-document adherence | No issues found in this diff — the new `StreamAlarmsAsync` surface matches the `MxGateway:Alarms` design ("server-streaming feed of alarm-state-change messages keyed by the same monitor"), and `AcknowledgeAlarmAsync` remains session-less. |
| 8 | Code organization & conventions | No issues found in this diff — new CLI subcommands route through the existing dispatch table; `RunBatchAsync` follows the established private-static pattern. |
| 9 | Testing coverage | No new issues — `RunAsync_StreamAlarms_*`, `RunAsync_AcknowledgeAlarm_*`, and `RunAsync_Batch_*` give the new surface unit coverage. `bench-read-bulk` is the same stress-harness-not-SDK shape called out in the prior re-review and is not flagged here. |
| 10 | Documentation & comments | Issue found (this review): the README examples for the two new alarm CLI subcommands cite wrong flag names and a non-existent `--session-id` (Client.Dotnet-018). The new XML docs on `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` and on the bulk SDK methods are accurate and complete. |
### Client.Dotnet-018
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Documentation & comments |
| Location | `clients/dotnet/README.md:137-138` |
| Status | Open |
**Description:** The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows:
```
mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json
mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
```
None of these flags exist. The actual CLI consumes `--filter-prefix`, `--max-events`, `[--jsonl]` for `stream-alarms` and `--reference`, `--comment`, `--operator`, `--json` for `acknowledge-alarm` (see `WriteUsage` lines 891-892 and the `StreamAlarmsAsync` / `AcknowledgeAlarmAsync` handlers in `MxGatewayClientCli.cs`). Neither command takes a `--session-id` — both go through the gateway's session-less central alarm monitor by design (the same property called out in the new XML doc on `MxGatewayClient.StreamAlarmsAsync` lines 226-237). A user who copies the README example will get a `--session-id` "unknown option" failure on both commands and an "Argument 'reference' is required" failure on `acknowledge-alarm` (since `--alarm-reference` is silently treated as `--alarm-reference` rather than `--reference`).
This is exactly the docs-in-the-same-change rule from `CLAUDE.md` ("Update docs in the same change as the source") — the README change in `8738735` documents the wrong surface for the source change in `11cc671`. The built-in `WriteUsage` text is correct; only the README walkthrough is wrong.
**Recommendation:** Replace the two example lines with the actual flag names:
```
mxgw-dotnet stream-alarms --filter-prefix Area001 --max-events 1 --json
mxgw-dotnet acknowledge-alarm --reference "\\Galaxy\Area001.Pump001.PumpFault" --comment "ack from cli" --operator operator1 --json
```
A quick sanity check would be to drive each example through the test harness's `MxGatewayClientCli.RunAsync` shape and confirm exit 0 — copy/paste safety on the documented examples is the only realistic safeguard.
### Client.Dotnet-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` |
| Status | Open |
**Description:** Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `register` oneof case falls through to `ReturnValue.Int32Value` and silently yields `0` for the handle. The new `BenchReadBulkAsync` handler introduced in commit `b3ae200` reinstates exactly that pattern at line 745:
```csharp
int serverHandle = registerReply.Register?.ServerHandle ?? registerReply.ReturnValue.Int32Value;
```
The bench then drives the rest of the run — `SubscribeBulk`, warmup `ReadBulk`, steady-state `ReadBulk`, `UnsubscribeBulk` — against a zero `ServerHandle`. The worker will reject every command, the bench will report zero successful calls + a wall of failures + a misleading `cachedReadResults = 0` summary, and there is no obvious diagnostic for "the gateway forgot to set the typed register payload"; it looks like a server-side outage. Note that Client.Dotnet-010's recorded resolution claimed a CLI-level `RequireRegisterServerHandle` / `RequireAddItemItemHandle` helper was introduced, but neither helper currently exists in the codebase, and the SDK-level `MxGatewaySession.RegisterAsync` / `AddItemAsync` / `AddItem2Async` at lines 82, 124, 175 still carry the same fallback — so the underlying problem behind Client.Dotnet-005 / 010 is unresolved at HEAD and this is a fresh net-new instance.
**Recommendation:** Replace the fallback with an explicit null-check on `registerReply.Register` that throws `MxGatewayException` with the missing-payload context (kind = `Register`, session id, correlation id) — the same shape Client.Dotnet-005 prescribes. If the upstream SDK helpers in `MxGatewaySession` are restored to throw, route the bench through `MxGatewaySession.RegisterAsync` instead so the CLI inherits the SDK's protection.
### Client.Dotnet-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` |
| Status | Open |
**Description:** `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`:
```csharp
try
{
reply = await client.InvokeAsync(
CreateCommandRequest(sessionId, readBulkMxCommand),
cancellationToken)
.ConfigureAwait(false);
sw.Stop();
}
catch
{
sw.Stop();
failedCalls++;
latencyMillis.Add(sw.Elapsed.TotalMilliseconds);
continue;
}
```
The `catch` has no type filter and no `when (!cancellationToken.IsCancellationRequested)` clause, so `OperationCanceledException` flowing from a cancelled `cancellationToken` (Ctrl+C, parent CTS, or the wall-clock budget) is swallowed identically to a real RPC failure. The loop continues to spin until `DateTime.UtcNow` reaches `steadyDeadline`, with each iteration immediately throwing OCE from the next `InvokeAsync` call — a tight CPU-burn until the configured `--duration-seconds` elapses, producing a per-iteration `latencyMillis.Add(sw.Elapsed.TotalMilliseconds)` flood that also skews the final p99/max numbers down.
The warmup loop above (lines 774-780) has no catch at all, so a warmup-time OCE escapes through the `finally` block — that path is correct. The steady-state loop should follow the same shape: either rethrow OCE explicitly, or break out of the loop on `cancellationToken.IsCancellationRequested`. Note that Client.Dotnet-015's resolution removed the outer `CancelAfter(timeout)` for `bench-read-bulk` / `bench-stream-events`, so the most common cancellation path today is interactive Ctrl+C — which today produces a 30-second hang plus skewed stats instead of a prompt exit.
**Recommendation:** Replace the bare `catch` with `catch (Exception) when (!cancellationToken.IsCancellationRequested)`, or split into `catch (OperationCanceledException) { throw; } catch (Exception) { failedCalls++; ... continue; }`. The first form is the smallest diff and matches the pattern used elsewhere in the CLI. Add a regression test that runs `bench-read-bulk` with a `--duration-seconds 10` budget against a fake that throws on every `InvokeAsync`, cancels the supplied token after 100 ms, and asserts the run exits in well under 10 s. The wider precedent — Client.Dotnet-016's `BenchStreamEventsAsync` cancellation hardening — should already cover the shape of the test fixture.
### Client.Dotnet-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` |
| Status | Open |
**Description:** Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking:
```csharp
// ReadBulkAsync (line 487)
TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0),
// BenchReadBulkAsync (line 715)
uint timeoutMs = (uint)arguments.GetInt32("timeout-ms", 1500);
```
`arguments.GetInt32(...)` returns `Int32.Parse(...)`, so a caller supplying `--timeout-ms=-1` (an easy mistake for "no timeout" or copy-paste from another tool that uses `-1` for unbounded) silently wraps to `0xFFFFFFFF` = ~49.7 days as the worker-side `ReadBulk` timeout. The gateway forwards that as the per-tag `timeout_ms` on `ReadBulkCommand`, and a misconfigured invocation parks one worker thread per pending tag for hours before MXAccess gives up — exactly the kind of slow-failure mode the protocol's `uint32 timeout_ms` field was meant to prevent. The `ReadBulkAsync` SDK overload on `MxGatewaySession` (line 663) already guards correctly: `timeout <= TimeSpan.Zero ? 0u : (uint)Math.Min(timeout.TotalMilliseconds, uint.MaxValue)`; the CLI handlers should match.
**Recommendation:** Wrap the cast in a guard, e.g.:
```csharp
int timeoutMsRaw = arguments.GetInt32("timeout-ms", 0);
if (timeoutMsRaw < 0)
{
throw new ArgumentException("--timeout-ms must be a non-negative integer (use 0 for the gateway default).");
}
uint timeoutMs = (uint)timeoutMsRaw;
```
A single shared helper (e.g. `ParseTimeoutMs(CliArguments, string, int)`) on `MxGatewayClientCli` would cover both call sites and remove the duplication.
+196 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/go` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 6 |
## Checklist coverage
@@ -51,6 +51,38 @@ and subpackage `mxgateway` intentionally unchanged per Go convention.
| 9 | Testing coverage | No issues found in this diff — `alarms_test.go` correctly drops retired `session_id` from `AcknowledgeAlarmRequest` and retains it on `QueryActiveAlarmsRequest`. |
| 10 | Documentation & comments | No issues found in this diff. |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. Four commits touched `clients/go/`:
`71d2c39` ported the long-lived `batch` subcommand; `8aaab82` added the bulk
SDK methods (`WriteBulk`/`Write2Bulk`/`WriteSecuredBulk`/`WriteSecured2Bulk`/`ReadBulk`)
and the six bulk/bench CLI subcommands (`read-bulk`, `write-bulk`, `write2-bulk`,
`write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`); `6f0d142`
ported `stream-alarms` and `acknowledge-alarm` (CLI + SDK type aliases +
`Client.StreamAlarms`); `8738735` updated the README to document the new SDK
methods. `gofmt -l .`, `go vet ./...`, `go build ./...`, and
`go test ./... -count=1` are clean at HEAD.
This re-review surfaces two regressions of previously-Resolved findings and
four genuinely-new issues introduced by the new bulk / batch / alarm surface.
Client.Go-022 and Client.Go-023 are recorded as new IDs (not status flips on
the original) because the relevant code was deleted and re-added in
`8aaab82`, so the prior Resolution remains historically accurate against
that earlier commit.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | New issue: bulk SDK methods accept a non-nil but empty entry/tag slice and send a no-op bulk command (Client.Go-025). |
| 2 | mxaccessgw conventions | No issues found in this diff. `gofmt -l .` and `go vet ./...` clean. |
| 3 | Concurrency & thread safety | New issue: `runBenchReadBulk` warm-up + steady-state loops once again ignore `ctx.Err()` — a regression of Client.Go-018 (Client.Go-023). |
| 4 | Error handling & resilience | New issue: `runBatch` uses a default 64 KiB `bufio.Scanner` buffer; one long bulk-args line aborts the whole batch session with no EOR sentinel (Client.Go-026). |
| 5 | Security | No issues found — `runAcknowledgeAlarm` doesn't log the `-comment` / `-operator` values; the bulk-write SDK doc comments still flag credential-sensitive values. |
| 6 | Performance & resource management | No issues found in this diff — `defer client.Close()` / `defer session.Close(...)` / `defer session.UnsubscribeBulk(...)` applied consistently in `runBenchReadBulk`. |
| 7 | Design-document adherence | No issues found in this diff. `Client.StreamAlarms` correctly skips `c.callContext` (streams must not carry the unary call timeout); `AcknowledgeAlarm` correctly uses it. |
| 8 | Code organization & conventions | New issues: `runWriteBulkVariant` re-introduces the unused `secured` parameter, the dead `_ = secured` line, and unconditionally-registered secured/non-secured flags — regresses Client.Go-015 (Client.Go-022). `runBatch` treats a blank line in the middle of a script as end-of-session (Client.Go-027). |
| 9 | Testing coverage | New issue: the five new bulk SDK methods and `Client.StreamAlarms` have no unit tests in `mxgateway/` (Client.Go-024). |
| 10 | Documentation & comments | No issues found in this diff. README documents the new `StreamAlarms`/`AcknowledgeAlarm` SDK calls; `Session.ReadBulk` documents the cached-vs-snapshot semantics and `timeout=0` default; `WriteSecuredBulk` flags credential sensitivity. |
## Findings
### Client.Go-001
@@ -432,3 +464,164 @@ But there is no test asserting that, e.g., `mxgw-go write-bulk -current-user-id
Each is a few lines and routes through the existing `runWithIO` entry point, so it does not need a bufconn fake.
**Resolution:** 2026-05-20 — Added CLI-level table-driven regression tests in `cmd/mxgw-go/main_test.go` routed through `runWithIO`, so they need no bufconn fake: `TestRunWriteBulkVariantGatesSecuredFlags` pins Client.Go-015 by asserting `write-bulk -current-user-id`, `write-bulk -verifier-user-id`, `write2-bulk -current-user-id`, `write-secured-bulk -user-id`, and `write-secured2-bulk -user-id` all surface `flag provided but not defined`; `TestRunReadBulkRejectsMissingArgs` pins the "session-id and items are required" check across no-flags / missing-items / missing-session-id; `TestRunBenchReadBulkRejectsNonPositiveBulkSize` and `TestRunBenchReadBulkRejectsNonPositiveDuration` pin the positivity checks; `TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues` pins the explicit `item-handles count ... does not match values count ...` error. `go test ./...` passes.
### Client.Go-022
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Code organization & conventions |
| Location | `clients/go/cmd/mxgw-go/main.go:398-412,417-519` |
| Status | Open |
**Description:** Commit `8aaab82` ("Go client: port bulk read/write SDK methods + CLI subcommands") re-introduces every symptom that Client.Go-015 documented and was marked Resolved against an earlier commit:
- `runWriteBulkVariant(ctx, args, stdout, stderr, command, withTimestamp, secured bool)` accepts a `secured` parameter that is never used for routing — the switch at `main.go:478` keys on `command` (`"write-bulk"` / `"write2-bulk"` / `"write-secured-bulk"` / `"write-secured2-bulk"`). The function ends with the misleading `_ = secured // currently only used for routing above; reserved for future per-variant validation` at `main.go:517`, the exact line Client.Go-015 called out as both inaccurate (`secured` is not used for routing) and reserved-without-purpose.
- All four wrappers (`runWriteBulk`, `runWrite2Bulk`, `runWriteSecuredBulk`, `runWriteSecured2Bulk` at `main.go:398-412`) pass a `secured` argument that has no effect.
- `-current-user-id`, `-verifier-user-id`, and `-user-id` are all registered unconditionally for every variant (`main.go:427-429`), so `mxgw-go write-bulk -current-user-id 5 -item-handles 1 -values foo` silently no-ops the flag instead of failing with `flag provided but not defined`. The secured-only flag gating that Client.Go-015's resolution claimed is gone again.
Because the surrounding test file (`main_test.go`) lost the regression tests promised by Client.Go-021 between then and this commit, nothing caught the re-introduction.
**Recommendation:** Re-apply the Client.Go-015 fix on this re-added code. Drop the `secured` parameter and the `_ = secured` line (the `command` switch is the only routing key); derive the variant locally from `command`; register `-current-user-id` / `-verifier-user-id` only inside the secured branches and `-user-id` only inside Write/Write2 — so a wrong-variant flag fails with a clean `flag provided but not defined` usage error. Re-add the `TestRunWriteBulkVariantGatesSecuredFlags` table-test from the Client.Go-021 resolution so a future regression is caught by CI.
**Resolution:** Open.
### Client.Go-023
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `clients/go/cmd/mxgw-go/main.go:604-606,616-632` |
| Status | Open |
**Description:** `runBenchReadBulk`'s warm-up and steady-state loops are wall-clock-only again:
```go
warmupDeadline := time.Now().Add(time.Duration(*warmupSeconds) * time.Second)
timeout := time.Duration(*timeoutMs) * time.Millisecond
for time.Now().Before(warmupDeadline) {
_, _ = session.ReadBulk(ctx, serverHandle, tags, timeout)
}
...
for time.Now().Before(steadyDeadline) {
callStart := time.Now()
results, err := session.ReadBulk(ctx, serverHandle, tags, timeout)
...
}
```
Neither loop checks `ctx.Done()` / `ctx.Err()`. This is exactly the shape Client.Go-018 documented and was marked Resolved against an earlier commit — the `&& ctx.Err() == nil` guard the Resolution added has been removed by commit `8aaab82` (which re-added `runBenchReadBulk`). If the cross-language driver `scripts/bench-read-bulk.ps1` times out and kills the child early, or an operator Ctrl+Cs the bench, each `ReadBulk` call inside fails fast with `context.Canceled` but the steady-state loop counts those as `failedCalls++` and keeps spinning for the full `*durationSeconds` window. That inflates `failedCalls` and `latencyMs.max` in the JSON the PowerShell driver collates across all five clients, and wastes CPU. Note that `runBenchReadBulk` is also the only Go CLI command that does not register its own signal handler — `runStreamAlarms` (added in the same diff range) and `runGalaxyWatch` both do.
**Recommendation:** Re-apply the Client.Go-018 fix: change both loop conditions to `for time.Now().Before(warmupDeadline) && ctx.Err() == nil` (and the same on `steadyDeadline`). The cross-language bench JSON shape is unchanged — the truncated window is just reported faithfully via `durationMs` / `totalCalls`. Optionally add the `signal.NotifyContext` pattern used by `runStreamAlarms` and `runGalaxyWatch` so direct Ctrl+C on the bench also short-circuits cleanly.
**Resolution:** Open.
### Client.Go-024
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/go/mxgateway/session.go:395-525`, `clients/go/mxgateway/alarms.go:65-76` |
| Status | Open |
**Description:** The five new bulk SDK methods on `Session` and the new `Client.StreamAlarms` method have **no unit tests** in `clients/go/mxgateway/`:
- `Session.WriteBulk` (`session.go:395`)
- `Session.Write2Bulk` (`session.go:418`)
- `Session.WriteSecuredBulk` (`session.go:442`)
- `Session.WriteSecured2Bulk` (`session.go:465`)
- `Session.ReadBulk` (`session.go:495`) — in particular the `timeout > uint32(^uint32(0))` ms saturation branch at `session.go:504-509` is interesting (a `time.Duration` is `int64` ns; the conversion silently clamps to `MaxUint32` ms ≈ 49 days) and is entirely untested.
- `Client.StreamAlarms` (`alarms.go:65`) — no equivalent of `TestQueryActiveAlarmsPassesFilterPrefix` exists for `StreamAlarms`, even though they share the same fake-server pattern.
`mxgateway/client_session_test.go` and `mxgateway/alarms_test.go` cover the prior single-item and bulk methods (`TestWriteBulkBuildsOneBulkCommandAndReturnsPerEntryResults`, `TestReadBulkForwardsTimeoutAndUnpacksCachedFlag`, etc. were referenced by Client.Go-021 but `Grep` of the current tree confirms they no longer exist — the test file ends at commit `8aaab82` without any of them), so library-level happy paths and error paths for the new shape are entirely uncovered.
**Recommendation:** Add bufconn-fake-server tests in `mxgateway/client_session_test.go` and `mxgateway/alarms_test.go`:
- `TestWriteBulkBuildsOneBulkCommandAndReturnsPerEntryResults` — confirm the protobuf payload is `MX_COMMAND_KIND_WRITE_BULK` carrying every entry, and the reply's `BulkWriteResult` list flows back. Mirror for Write2Bulk, WriteSecuredBulk, WriteSecured2Bulk.
- `TestReadBulkForwardsTimeoutAndUnpacksCachedFlag` — pin `timeoutMs` arithmetic for `0`, a normal `time.Duration`, and the `> MaxUint32 ms` saturation branch; assert `WasCached` propagates from each `BulkReadResult`.
- `TestStreamAlarmsPassesFilterPrefix` — mirror `TestQueryActiveAlarmsPassesFilterPrefix`; assert request flows and stream Recv returns each fake `AlarmFeedMessage` (active-alarm snapshot, snapshot-complete sentinel, transition).
These plus `nil` / empty-slice rejection tests for each bulk method close out the new public surface.
**Resolution:** Open.
### Client.Go-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/go/mxgateway/session.go:395-485,495-525` |
| Status | Open |
**Description:** The five new bulk methods (`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk`) each guard with `if entries == nil { return error }` and an upper-bound `ensureBulkSize` check, but accept a non-nil empty slice (e.g. `[]*WriteBulkEntry{}` or `[]string{}`). The call then sends an `MX_COMMAND_KIND_WRITE_BULK` (or peer) command with zero entries across the gRPC wire to the gateway, which forwards to the worker for a no-op round trip. This is the same shape Client.Go-015 / Client.Go-021 were written against (the CLI now also accepts `mxgw-go write-bulk -item-handles , -values ,` which `parseInt32List` returns as empty without error). The pre-existing bulk methods (`AddItemBulk`, `AdviseItemBulk`, etc. at `session.go:253-343`) carry the identical pattern, so this is a long-standing convention — but it's still a real cost on the hot path. The Java / .NET / Rust / Python clients should be checked for parity if this is fixed.
**Recommendation:** Pick one model and apply it consistently to every bulk SDK method (new and pre-existing):
1. Reject empty slices early: `if len(entries) == 0 { return nil, errors.New("mxgateway: write bulk entries are required") }` — this is the spelling already used on the nil branch, and matches the GoStyleGuide's "Return errors instead of panicking" / "Wrap errors with useful context" voice.
2. Or document at the SDK doc-comment that an empty slice is a valid no-op (it produces `[]*BulkWriteResult{}` and one wasted gRPC round trip), and have the CLI reject empty `-item-handles` / `-values` before dialing.
Option 1 is cheaper for callers (one less round trip and one clearer error message) and removes the empty-list footgun for cross-language drivers that may pass empty arrays from PowerShell `,` splits.
**Resolution:** Open.
### Client.Go-026
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/go/cmd/mxgw-go/main.go:1196-1222` |
| Status | Open |
**Description:** `runBatch` reads command lines with a default `bufio.Scanner`:
```go
scanner := bufio.NewScanner(in)
for scanner.Scan() {
...
}
return scanner.Err()
```
The default `bufio.Scanner` token size is 64 KiB (`bufio.MaxScanTokenSize`). One long subcommand invocation — e.g. a `subscribe-bulk -items tag001,tag002,...,tag5000` line with several thousand tag references, or a `write-bulk` driven by a Galaxy-sized handle list — crosses 64 KiB and the scanner errors out with `bufio.ErrTooLong` mid-session. Two follow-on effects:
- The error is returned only at the outer `return scanner.Err()`, so the cross-language batch harness sees neither the EOR sentinel for the failing command nor any of the still-pending commands queued after it.
- The fix is one line: `scanner.Buffer(make([]byte, 0, 64*1024), 16*1024*1024)` (or larger), which lifts the cap on one-line tokens without buffering the whole input.
A second weakness: `strings.Fields(line)` splits on whitespace and does no quote-aware tokenisation, so a literal `-comment "with spaces"` becomes four args (`-comment`, `"with`, `spaces"`). This isn't a Go-specific bug — every language's batch tokeniser has this surface — but it means certain CLI flags that take free-text values (`acknowledge-alarm -comment`, `acknowledge-alarm -operator`) can't carry spaces in batch mode. Worth documenting if not fixed.
**Recommendation:** Call `scanner.Buffer(make([]byte, 0, 64*1024), 16*1024*1024)` immediately after `bufio.NewScanner` so a long bulk-args line doesn't abort the session. If `runBatch` is intended to support free-text flag values (the `acknowledge-alarm -comment` shape is the obvious case), swap `strings.Fields` for a quote-aware tokeniser (`mvdan.cc/sh/v3/syntax` or a small inline state machine matching the .NET/Rust harness shape). Otherwise add a one-line comment to `runBatch`'s doc-comment that batch-mode arguments must not contain whitespace.
**Resolution:** Open.
### Client.Go-027
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/go/cmd/mxgw-go/main.go:1195-1206` |
| Status | Open |
**Description:** `runBatch`'s doc-comment says the loop "never terminates on command error; only stdin EOF (or an empty line) ends the session", and the implementation matches:
```go
for scanner.Scan() {
line := scanner.Text()
if line == "" {
break
}
...
}
```
The cross-language batch harness sends `\n`-delimited command lines into the child's stdin and reads back EOR-framed results — there is no documented "empty line ends the session" framing in any other client (the .NET, Rust, Python, and Java drivers should be diffed if a fix lands). A stray blank line in a longer script (very easy in PowerShell here-strings, which prefer trailing newlines) terminates the entire Go batch session silently — the harness sees a clean exit and reads no EOR for the still-pending commands, which presents as "the next command never ran" with no error path.
The two cases the empty-line check seems to cover — (a) operator pressing Enter on an interactive `mxgw-go batch` prompt, and (b) `\n\n` framing between commands — would both be cleaner under a `continue` (skip blank lines, keep reading until real EOF) than a `break`. Option (a) isn't a real workflow (batch is a programmatic interface); option (b) is the actual bug.
**Recommendation:** Change `if line == "" { break }` to `if line == "" { continue }` (alongside the existing `len(args) == 0` continue, which is then redundant — keep one, drop the other for clarity). Update the `runBatch` doc-comment to read "only stdin EOF ends the session" and drop the "or an empty line" clause. If the interactive ergonomic is genuinely wanted, gate it on `isatty(stdin)` so the batch-from-pipe case isn't affected.
**Resolution:** Open.
+138 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/java` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 5 |
## Checklist coverage
@@ -52,6 +52,31 @@ documentation updates lag — see the doc-side findings below.
| 9 | Testing coverage | Cross-reference to Client.Java-030; the bulk-command gaps tracked under Client.Java-026 remain. |
| 10 | Documentation & comments | Issues found: Client.Java-027 (Gradle task names in README and JavaClientDesign still reference the old `:mxgateway-client:` and `:mxgateway-cli:` paths — every command in the README breaks if copy-pasted); Client.Java-029 (`README.md:209` cites `zb-mom-ww-mxgateway-cli/build/install/mxgateway-cli` but the actual install path contains a doubled directory `zb-mom-ww-mxgateway-cli/build/install/zb-mom-ww-mxgateway-cli/`). |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. Diff against `d692232` is five commits touching
`clients/java`: `10bd0c0` (Client.Java-027..031 fixes), `71d2c39` (port
`batch` subcommand), `f90bff0` (port bulk read/write SDK + CLI subcommands),
`8a0c59d` (port `stream-alarms` + `acknowledge-alarm` to the Java CLI plus a
new `MxGatewayClient.streamAlarms` SDK method returning
`MxGatewayAlarmFeedSubscription`), and `8738735` (README updates for the
alarm surface). `gradle build` from `clients/java` is green; new findings
focus on the alarm surface and the `batch` subcommand. Prior findings
Client.Java-001..031 are unchanged.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: `BatchCommand` tokenises stdin with `line.trim().split("\\s+")`, so any argument with embedded whitespace (e.g. `--comment "needs verification"`) is shredded mid-string (Client.Java-034); `StreamAlarmsCommand` uses `queue.offer(value)` on a bounded queue with no overflow detection, silently dropping alarm messages when the gateway pushes faster than the consumer drains (Client.Java-033). |
| 2 | mxaccessgw conventions | Cross-reference to Client.Java-033 — silently-dropping events violates `JavaStyleGuide.md` ("Do not reorder, coalesce, or drop events in client code") and the fail-fast event-backpressure contract. No new finding raised separately. |
| 3 | Concurrency & thread safety | No issues found in this diff. `MxGatewayAlarmFeedSubscription` correctly mirrors `MxGatewayEventSubscription`'s `beforeStart`-vs-`cancel` race fix (Client.Java-014). |
| 4 | Error handling & resilience | No issues found in this diff. |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. The new `streamAlarms` SDK surface matches `gateway.md`'s always-on alarm-monitor design (no worker session, multi-subscriber). |
| 8 | Code organization & conventions | Issue found: `MxGatewayAlarmFeedSubscription` is a structural copy of `MxGatewayEventSubscription` (same fields, same `beforeStart`/`cancel` shape) — the subscription-class family should share a common base or helper (Client.Java-036). |
| 9 | Testing coverage | Issue found: the new `MxGatewayClient.streamAlarms` SDK method has no library-side test in `zb-mom-ww-mxgateway-client/src/test/...` — only the CLI test exercises the path via a `FakeClient.streamAlarms` override that bypasses the production `subscription.wrap(observer)` glue (Client.Java-035). |
| 10 | Documentation & comments | Issue found: README (`clients/java/README.md:182-183`) documents the new `stream-alarms` and `acknowledge-alarm` commands with `--session-id <id>` (neither command has that option) and `acknowledge-alarm --alarm-reference …` (actual flag is `--reference`) — every documented invocation fails at picocli parse time (Client.Java-032). |
## Findings
### Client.Java-001
@@ -518,3 +543,113 @@ documentation updates lag — see the doc-side findings below.
**Recommendation:** Either (a) update the prose to the full prefixed names, or (b) clarify in a one-line note: "The subprojects are `zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli`; this README refers to them by their short suffixes below for readability." (a) is more honest; (b) preserves readability at the cost of one extra concept.
**Resolution:** 2026-05-24 — Took option (a): updated the README layout-section prose in `clients/java/README.md:17,22,26` (the `mxgateway-client` generates… paragraph, the `mxgateway-client` exposes… paragraph, and the `mxgateway-cli` depends on `mxgateway-client`… paragraph) to use the full prefixed `zb-mom-ww-mxgateway-client` and `zb-mom-ww-mxgateway-cli` names, matching the layout block at lines 1314 and `settings.gradle`. Reflows the final paragraph slightly because the prefixed names push past the existing 80-column wrap; content is unchanged otherwise. Doc-only change.
### Client.Java-032
| Field | Value |
|---|---|
| Severity | High |
| Category | Documentation & comments |
| Location | `clients/java/README.md:182-183` |
| Status | Open |
**Description:** Commit `8738735` ("clients: document StreamAlarms + AcknowledgeAlarm in each README") added two new gradle invocations to the CLI Usage block:
```
gradle :zb-mom-ww-mxgateway-cli:run --args="stream-alarms --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --limit 1 --json"
gradle :zb-mom-ww-mxgateway-cli:run --args="acknowledge-alarm --endpoint localhost:5000 --api-key-env MXGATEWAY_API_KEY --plaintext --session-id <id> --alarm-reference \"\\Galaxy\Area001.Pump001.PumpFault\" --json"
```
Both are wrong against the actual CLI surface in `MxGatewayCli.java`:
- `StreamAlarmsCommand` (line 1060) has no `--session-id` option — the gateway's alarm feed is session-less ("Served by the gateway's always-on alarm monitor — no worker session is opened" per the SDK Javadoc at `MxGatewayClient.java:331`). picocli will reject the unknown option with a non-zero exit before the command body runs.
- `AcknowledgeAlarmCommand` (line 1135) also has no `--session-id` option, AND the option is named `--reference` (line 1136), not `--alarm-reference`. Two unknown-option errors per copy-paste.
A user copying either invocation from the README hits a picocli parse error immediately. The other clients' README updates in the same commit (commit `8738735`) likely have the same issue but are out of scope for this Java review.
**Recommendation:** Drop the `--session-id <id>` token from both documented invocations, and change `--alarm-reference` to `--reference` in the `acknowledge-alarm` line. Optionally also add `--filter-prefix` to the `stream-alarms` example so readers see the scoping option, and align README option names with the actual CLI by either renaming the CLI option `--reference``--alarm-reference` (matches the proto `alarm_full_reference` field semantically) or leaving as is and only fixing the README. Add a small `MxGatewayCliTests` parse-only assertion for both subcommands that exercises every option flag to prevent the same drift the next time the CLI surface or README is touched.
### Client.Java-033
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `clients/java/zb-mom-ww-mxgateway-cli/src/main/java/com/zb/mom/ww/mxgateway/cli/MxGatewayCli.java:1078-1098` |
| Status | Open |
**Description:** `StreamAlarmsCommand.call()` allocates a bounded `ArrayBlockingQueue<Object>(1024)` and the gRPC observer publishes each `AlarmFeedMessage` via `queue.offer(value)`:
```
BlockingQueue<Object> queue = new ArrayBlockingQueue<>(1024);
@Override public void onNext(AlarmFeedMessage value) { queue.offer(value); }
```
`BlockingQueue.offer` returns `false` when the queue is full and the message is dropped silently — no exception, no log, no signal to the consumer that data was lost. Combined with gRPC's default auto-inbound flow control (the same regime that Client.Java-011 documented for `MxEventStream`), any consumer stall long enough to fill 1024 slots will silently lose alarms past the boundary. The active-alarm snapshot replay at session open is exactly the burst case that pushes a moderately busy alarm feed close to that limit, and the failure mode is invisible — the CLI prints a truncated feed and exits 0.
The library-side `MxEventStream` (Client.Java-002 resolution) and `DeployEventStream` (Client.Java-021 resolution) explicitly surface overflow as `MxGatewayException` so the consumer can resubscribe with a resume cursor. The new `stream-alarms` CLI path regresses that contract.
`JavaStyleGuide.md` ("Streaming" section, line 57) explicitly prohibits this pattern: "Do not reorder, coalesce, or drop events in client code." `CLAUDE.md` ("fail-fast event backpressure") makes overflow a deliberate cancel-and-surface signal, not a silent drop.
**Recommendation:** Either (a) wrap the gRPC observer in the existing `MxEventStream`-style adaptor that calls `subscription.cancel()` and queues an exception on `queue.offer` returning `false`, then surface that exception from the drain loop — mirroring `MxEventStream.observer().onNext`'s overflow branch; or (b) reuse the library-side fail-fast plumbing by promoting `MxEventStream` (or extracting its terminal-state base) into a public `MxAlarmFeedStream` and have `MxGatewayClient.streamAlarms` return that instead of a bare subscription handle. Option (b) lines up with Client.Java-036 (deduplicate the subscription class family). Add a CLI regression test that overflows the bounded queue and asserts a non-zero exit / overflow exception, mirroring `MxGatewayMediumFindingsTests.eventStreamOverflowExceptionSurvivesASubsequentClose`.
### Client.Java-034
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `clients/java/zb-mom-ww-mxgateway-cli/src/main/java/com/zb/mom/ww/mxgateway/cli/MxGatewayCli.java:182-198` |
| Status | Open |
**Description:** `BatchCommand.call()` reads one CLI invocation per stdin line and tokenises with:
```
String[] args = line.trim().split("\\s+");
int exitCode = cmd.execute(args);
```
`split("\\s+")` does no shell-quoting parsing — it just splits on whitespace runs. Any argument with embedded whitespace gets shredded mid-string. Examples that break:
- `acknowledge-alarm --comment "needs verification" --operator op1` becomes `[acknowledge-alarm, --comment, "needs, verification", --operator, op1]` — picocli sees `"needs` as the comment value and `verification"` as an unknown positional argument.
- `galaxy-watch --json --last-seen-deploy-time "2026-04-28 18:30:00Z"` (any value with a space) fails the same way.
The commit message for `71d2c39` says the protocol expects "one line of stdin = one full CLI invocation" without specifying whether quoted arguments are honoured, but the cross-language matrix (`scripts/run-client-e2e-tests.ps1` per the prior reviews) compares output across all five clients — if other clients (.NET, Go, Rust, Python) implement shell-like tokenisation, the Java CLI will diverge on any command-line value containing whitespace, and the e2e harness will fail or skip those cases inconsistently.
The current `MxGatewayCliTests` test set (`batchCommandExecutesVersionAndEmitsEorMarker`, `batchCommandEmitsEorAfterFailedCommandAndContinues`) only covers whitespace-free args, so the bug is invisible in the test suite.
**Recommendation:** Replace `line.trim().split("\\s+")` with a real shell-style tokeniser that honours single and double quotes and backslash escapes — `picocli.CommandLine.ArgumentParser` doesn't ship one, but Apache Commons Exec's `CommandLine.translateCommandline(String)`, JDK 21's `java.util.spi.ToolProvider` argument parsing, or a small hand-written state machine all work. Cross-check the .NET / Go / Rust / Python `batch` implementations in the same change so all five clients use the same tokenisation; document the contract in the protocol comment in `MxGatewayCli.java` and in `scripts/run-client-e2e-tests.ps1`. Add a CLI test that feeds `acknowledge-alarm --comment "with spaces"` through `batch` and asserts the `--comment` value reaches the gateway as `"with spaces"`.
### Client.Java-035
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/java/zb-mom-ww-mxgateway-client/src/test/java/com/zb/mom/ww/mxgateway/client/MxGatewayClientSessionTests.java` |
| Status | Open |
**Description:** Commit `8a0c59d` added `MxGatewayClient.streamAlarms(StreamAlarmsRequest, StreamObserver<AlarmFeedMessage>)` and a new public `MxGatewayAlarmFeedSubscription` class. No library-side test exercises either: a grep for `streamAlarms` across `zb-mom-ww-mxgateway-client/src/test/...` returns zero matches. The CLI tests (`MxGatewayCliTests.streamAlarmsCommand*`) exercise the path end-to-end, but they route through a `FakeClient.streamAlarms` override that bypasses the production `subscription.wrap(observer)` glue and the `withStreamDeadline(rawAsyncStub()).streamAlarms(...)` call. A regression to either — forgetting `.wrap(observer)`, dropping the deadline interceptor, misnaming the request — would compile and pass the CLI tests but break against a real gateway.
This is the same coverage gap pattern as Client.Java-030 (no fixture test for `QueryActiveAlarmsRequest`) which was resolved by adding `queryActiveAlarmsForwardsRequestAndStreamsSnapshots` to `MxGatewayClientSessionTests`. The new alarm-feed RPC deserves the same shape.
**Recommendation:** Add `streamAlarmsForwardsRequestAndStreamsAlarmFeedMessages` to `MxGatewayClientSessionTests` (in-process gRPC via the existing `InProcessGateway` + `TestGatewayService` fixture): override `TestGatewayService.streamAlarms` to capture the inbound `StreamAlarmsRequest` and emit one `active_alarm` snapshot, one `snapshot_complete`, and one `transition`, then complete. Call `MxGatewayClient.streamAlarms`, drain the observer via a `CountDownLatch`, and assert (a) the server observed the `alarm_filter_prefix`, (b) all three messages arrived in order with the expected payload-case, and (c) `MxGatewayAlarmFeedSubscription.cancel()` aborts the call (latch via `ServerCallStreamObserver.setOnCancelHandler`, mirroring the Client.Java-015 cancellation regression). Optionally also cover the cancel-before-beforeStart race that `MxGatewayAlarmFeedSubscription.wrap` handles, mirroring `mxEventStreamCloseBeforeBeforeStartCancelsStream`.
### Client.Java-036
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/java/zb-mom-ww-mxgateway-client/src/main/java/com/zb/mom/ww/mxgateway/client/MxGatewayAlarmFeedSubscription.java`, `MxGatewayEventSubscription.java`, `MxGatewayActiveAlarmsSubscription.java`, `DeployEventSubscription.java` |
| Status | Open |
**Description:** `MxGatewayAlarmFeedSubscription` is a structural near-copy of `MxGatewayEventSubscription` — same `AtomicReference<ClientCallStreamObserver<…>>` + `AtomicBoolean cancelled` field shape, the same `wrap(observer)` returning a `ClientResponseObserver` that stores `requestStream` in `beforeStart`, the same close-before-beforeStart race handling that Client.Java-014 originally fixed for `MxEventStream`, and the same `cancel()`+`close()` idempotency contract. The four subscription classes (`MxGatewayEventSubscription`, `MxGatewayActiveAlarmsSubscription`, `MxGatewayAlarmFeedSubscription`, `DeployEventSubscription`) are now ~60-line near-clones differing only in the request/response generic parameters and the `cancel` message string.
This is the same maintenance-hazard pattern Client.Java-009 / Client.Java-016 identified for `createChannel` / `withDeadline` / `shutdown` and which was resolved by extracting `MxGatewayChannels`. A future fix to one subscription class (e.g. propagating Client.Java-014's race fix to a future fifth subscription type, or hardening idempotent `cancel`) will silently miss the others — exactly what happened with Client.Java-021 (the `MxEventStream` terminal-state fix that didn't reach `DeployEventStream` for months).
**Recommendation:** Extract a package-private abstract base, e.g. `MxGatewayStreamSubscription<TRequest>`, holding the `AtomicReference` / `AtomicBoolean` pair, the `cancel()` / `close()` implementation, and a `ClientResponseObserver` factory parameterised by the cancel-message string and the response observer. Have all four subscription classes extend it. Behaviour-only refactor — no public API change, existing tests cover the contract.
+268 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/python` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 5 |
## Checklist coverage
@@ -28,6 +28,38 @@ history. This section reflects categories evaluated in this pass.
| 9 | Testing coverage | Issue found: no test exercises the wheel-build / editable-install flow; the broken `pyproject.toml` (Client.Python-018) was not caught at commit time because the test suite runs from `src/` via `pytest pythonpath` (Client.Python-020). |
| 10 | Documentation & comments | Issue found: cross-client CLI parity gap — the Python CLI ships none of the Galaxy subcommands (`galaxy-test-connection`, `galaxy-last-deploy`, `galaxy-discover`, `galaxy-watch`) the .NET / Go / Rust / Java CLIs all expose, and lacks the new `.NET`-only `bench-stream-events`. README does not flag the gap (Client.Python-021). |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. The diff against the previous review base
`d692232` is four commits affecting `clients/python`:
- `71d2c39` e2e: port `batch` subcommand to all five client CLIs
- `6add4b4` Python client: port bulk read/write SDK methods + CLI subcommands
- `828e3e6` Python client: port stream-alarms and acknowledge-alarm
- `8738735` clients: document StreamAlarms + AcknowledgeAlarm in each README
Surface area added: `Session.read_bulk` / `write_bulk` / `write2_bulk` /
`write_secured_bulk` / `write_secured2_bulk`; `GatewayClient.stream_alarms`
+ `_canceling_alarm_feed_iterator`; the corresponding CLI subcommands
`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`,
`write-secured2-bulk`, `bench-read-bulk`, `stream-alarms`,
`acknowledge-alarm`, and `batch`; new README CLI examples for the alarm
subcommands; new CLI tests for `stream-alarms` / `acknowledge-alarm`
registration and `batch` semantics.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: `bench-read-bulk` does a function-local `import time` and uses bare `except Exception: pass` in cleanup blocks (Client.Python-026). |
| 2 | mxaccessgw conventions | No new issues found — secured writes still redact, generated code untouched, MXAccess parity preserved. |
| 3 | Concurrency & thread safety | No new issues found — `_canceling_alarm_feed_iterator` follows the same shape as `_canceling_iterator` (Client.Python-007 helper). |
| 4 | Error handling & resilience | No new issues found in the new SDK methods; new RPC mapping `map_rpc_error("stream alarms", ...)` is correct. |
| 5 | Security | Issue found: `_use_plaintext` localhost auto-plaintext branch resolved under Client.Python-013 is back in the renamed CLI module and was carried forward through the new commit untouched (Client.Python-023). |
| 6 | Performance & resource management | No new issues found. |
| 7 | Design-document adherence | No new issues found in the new alarm / bulk surface — matches the cross-client parity matrix expectation. |
| 8 | Code organization & conventions | Issue found: the new `batch` subcommand uses `click.testing.CliRunner` from production code (Client.Python-024). |
| 9 | Testing coverage | Issue found: the new SDK methods and most of the new CLI subcommand bodies have no behavioural tests — only `--help` smoke tests for the alarm CLI (Client.Python-025). |
| 10 | Documentation & comments | Issue found: the README CLI examples for `stream-alarms` and `acknowledge-alarm` use flags that do not exist on the implemented commands (Client.Python-022). |
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232`. Diff against `a020350` is commit `397d3c5`:
@@ -795,3 +827,236 @@ the cross-language benchmark matrix has a record of the asymmetry.
No CLI source change; the implementation of the four Galaxy
subcommands is deferred. Resolved as a doc note rather than a full
parity fix.
### Client.Python-022
| Field | Value |
|---|---|
| Severity | High |
| Category | Documentation & comments |
| Location | `clients/python/README.md:201-202`, `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:389-420` |
| Status | Open |
**Description:** The README CLI examples added by commit `8738735` for the
new alarm subcommands cite flags the CLI does not accept:
```
mxgw-py stream-alarms --session-id <id> --max-messages 1 --json
mxgw-py acknowledge-alarm --session-id <id> --alarm-reference "\\Galaxy\Area001.Pump001.PumpFault" --json
```
Both subcommands are session-less (the alarm feed is served by the gateway
itself, not a worker session — see the docstring on `acknowledge_alarm`,
"Acknowledge an active MXAccess alarm condition (session-less)"). Neither
`@main.command("stream-alarms")` nor `@main.command("acknowledge-alarm")`
declares a `--session-id` option, and `acknowledge-alarm` declares the
ack-target as `--reference`, **not** `--alarm-reference`. A user copy-pasting
either example gets `Error: no such option: --session-id` (or
`--alarm-reference`) and exits non-zero before any RPC is attempted.
This drift is invisible to the test suite because
`tests/test_cli.py::test_acknowledge_alarm_requires_reference` only asserts
that the missing-flag error mentions `--reference` — it does not validate
the README at all. The .NET / Go / Rust / Java alarm CLI examples in the
sibling READMEs are consistent with their actual flag names, so the Python
README is the only one out of step with its implementation.
**Recommendation:** Either fix the README examples to match the implementation
(remove `--session-id` from both lines, rename `--alarm-reference` to
`--reference`), or — if cross-client parity wants the longer flag name —
rename the CLI option to `--alarm-reference` and add a test that copy-pastes
the README examples through `CliRunner` to assert they parse. Option (1) is
the smaller change.
### Client.Python-023
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Security |
| Location | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:901-906` |
| Status | Open |
**Description:** Client.Python-013 (severity Medium, Security) was marked
**Resolved** on 2026-05-20 with the explicit claim that the silent
`localhost:` / `127.0.0.1:` auto-plaintext branch had been removed from
`_use_plaintext`. The re-review at `d692232` re-asserted this in its
checklist ("`_use_plaintext` now requires explicit `--plaintext` opt-in
(Client.Python-013 resolution verified)").
The branch is still present in the reviewed source at HEAD `42b0037`:
```python
def _use_plaintext(kwargs: dict[str, Any]) -> bool:
if kwargs.get("use_tls"):
return False
if kwargs.get("plaintext"):
return True
return kwargs["endpoint"].startswith("localhost:") or kwargs["endpoint"].startswith("127.0.0.1:")
```
The same code is present in `git show d692232:clients/python/src/zb_mom_ww_mxgateway_cli/commands.py`, so the regression entered at or
before the rename commit `397d3c5` (which created
`src/zb_mom_ww_mxgateway_cli/commands.py` from scratch with the
pre-Client.Python-013 body) and was not noticed in the prior re-review.
The original security argument is unchanged: a user who runs the gateway
behind TLS on loopback for production-shaped local testing and passes
`--api-key mxgw_<secret>` against `localhost:5001` silently gets a plaintext
gRPC channel, with the bearer token attached to it. The other clients
(.NET https-prefix detection, Go / Java explicit `--plaintext`, Rust
opt-in) do not auto-downgrade. The Client.Python-013 resolution also added
six regression tests in `tests/test_cli.py` that asserted the explicit-flag
contract; those tests do not exist in the current `tests/test_cli.py`
either they were lost in the rename or never carried over.
**Recommendation:** Re-apply the Client.Python-013 fix on the current
source: drop the `endpoint.startswith("localhost:") or endpoint.startswith("127.0.0.1:")`
branch, make `--plaintext` and `--tls` mutually exclusive, default to TLS,
and add an assertion-time test that copy-pastes the Client.Python-013
regression-test fixture into `tests/test_cli.py`. Because Client.Python-013
is marked Resolved with a 2026-05-20 commit reference, do **not** silently
re-resolve this finding — keep it Open with a fresh ID so the regression
audit trail is preserved.
### Client.Python-024
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Code organization & conventions |
| Location | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:13,48-119` |
| Status | Open |
**Description:** The new `batch` subcommand (commit `71d2c39`) implements
the cross-language batch protocol by importing `click.testing.CliRunner`
into production code and calling `runner.invoke(main, args, catch_exceptions=True)`
in a `for raw_line in sys.stdin:` loop. `CliRunner` is documented as a
**testing** helper:
* It replaces `sys.stdin` / `sys.stdout` / `sys.stderr` with `io.StringIO`
during each `invoke()`, swallowing any side-channel output the inner
command writes directly to the real streams (the existing CLI bodies do
not, but any future helper that calls `print()` mid-command will be
silently captured into `result.output` rather than reaching the
harness real-time).
* It captures the inner command into an `Exception` rather than letting
Click's normal exit code propagate, so `result.exit_code` is the
pseudo-exit of a `SystemExit` translation, not the real process exit.
* Click does not guarantee `CliRunner` is stable across versions —
click 9 has already deprecated `runner.invoke(..., mix_stderr=...)`,
and a future Click release could change the return-tuple shape.
* It is recursively re-entrant: `runner.invoke(main, ["batch"], ...)`
inside batch silently spawns a nested batch reading from the same
StringIO-replaced stdin (empty), so a stdin line of `batch` exits
cleanly with no error — almost certainly not the intended semantics.
The other client CLIs in the matrix (.NET, Go, Rust, Java) implement
`batch` by dispatching to their command parser directly, not by
re-invoking the test runner.
**Recommendation:** Replace `CliRunner` with a direct call into the Click
parser, e.g. `main.main(args, standalone_mode=False)` wrapped in a
`try/except click.ClickException` to convert Click exit conditions into
the `{"error": ..., "type": ...}` payload. Capture stdout via a per-line
context manager (e.g. `contextlib.redirect_stdout(io.StringIO())`) so the
batch loop can interleave inner-command output with the
`__MXGW_BATCH_EOR__` sentinel without depending on the testing API. Add
a regression test that drives `batch` with `batch\n` on stdin and asserts
recursive invocation is either rejected or correctly bounded.
### Client.Python-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/python/tests/test_cli.py`, `clients/python/src/zb_mom_ww_mxgateway/{client.py,session.py}`, `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py` |
| Status | Open |
**Description:** Commits `6add4b4` and `828e3e6` added five new SDK
methods (`Session.read_bulk`, `Session.write_bulk`, `Session.write2_bulk`,
`Session.write_secured_bulk`, `Session.write_secured2_bulk`),
`GatewayClient.stream_alarms`, the helper
`_canceling_alarm_feed_iterator`, and eight new CLI subcommands
(`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`,
`write-secured2-bulk`, `bench-read-bulk`, `stream-alarms`,
`acknowledge-alarm`). The only test coverage added in `tests/test_cli.py`
is:
* `test_stream_alarms_is_registered``--help` smoke only.
* `test_acknowledge_alarm_requires_reference` — verifies the missing-flag
Click error contains `--reference`; no happy-path test.
There is no test that:
1. Asserts `Session.read_bulk` / `write_bulk` / `write2_bulk` /
`write_secured_bulk` / `write_secured2_bulk` builds the expected
`MxCommand` shape (kind, sub-message, server_handle, entries) — the
prior Client.Python-009 coverage pattern (`test_add_item2_sends_*`,
`test_write2_sends_value_and_timestamp_value`) is not extended to the
bulk family even though they ship the same wire-shape risk.
2. Exercises `_canceling_alarm_feed_iterator` cancel-on-task-cancellation
(the Client.Python-007 helper test pattern is not extended).
3. Drives `stream-alarms` / `acknowledge-alarm` / `read-bulk` /
`write-bulk` / `write-secured-bulk` happy paths through `CliRunner`
with a fake stub — the existing
`test_cli_register_happy_path_emits_server_handle` pattern is not
extended.
4. Asserts `bench-read-bulk` emits the cross-language schema for the new
`read-bulk`-shaped fields (the Client.Python-015 pattern existed for
the previous bench command but no equivalent exists for this one).
A silent drift in any of the four bulk-write request shapes — or a
schema drift on `bench-read-bulk` — would not be caught.
**Recommendation:** Extend the Client.Python-009 / Client.Python-015 /
Client.Python-016 patterns: add request-shape tests for the four new
bulk methods, a CLI happy-path test for `read-bulk` / `write-bulk` /
`stream-alarms` / `acknowledge-alarm` against fake stubs, and a
cross-language schema test for `bench-read-bulk` mirroring
`test_bench_read_bulk_emits_cross_language_schema` (with `--read-bulk`
applied to the renamed bench). At minimum, add a request-shape test for
`write_secured_bulk` since the secured family is the highest-risk
parity surface.
### Client.Python-026
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:674-738` |
| Status | Open |
**Description:** Two minor quality issues in the new `_bench_read_bulk`
body (commit `6add4b4`):
1. `import time` is done inside the function body (line 676) rather than
at module top. `PythonStyleGuide.md` does not state this explicitly,
but every other helper in `commands.py` imports its dependencies at
module top, and `time` is already imported (transitively) elsewhere
in the package. The function-local import is a vestige of incremental
development and should be hoisted.
2. The `finally` cleanup block uses two consecutive bare
`except Exception: pass` blocks to swallow `unsubscribe_bulk` and
`session.close()` failures (lines 733-734 and 737-738). This silently
discards diagnostic information about cleanup failures — e.g. a
transient gateway crash mid-benchmark or a protocol error during
unsubscribe — and matches an anti-pattern the rest of the module
avoids (the `_bench_read_bulk` analogues in the other clients log the
cleanup failure before swallowing it).
Both are Low severity. The bench command is best-effort by design (the
benchmark output is what matters; cleanup failures are not user-facing),
but a single log line on cleanup failure would make a future regression
visible at the next benchmark run rather than silently corrupting the
worker's subscription bookkeeping until a session-level GC sweep.
**Recommendation:** Move `import time` to the module-level import block.
Replace each `except Exception: pass` with `except Exception as exc:
logger.warning("bench-read-bulk cleanup: %s", exc)` against a
module-level `logger = logging.getLogger(__name__)`. No behavioural
change in the happy path; failure path becomes diagnosable. No new test
required for the import hoist; the logger change is exercised by the
existing bench smoke test once `caplog` is added to the test signature.
+247 -3
View File
@@ -5,9 +5,9 @@
| Module | `clients/rust` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 8 |
## Checklist coverage
@@ -50,6 +50,52 @@ member layout unchanged. `cargo test --workspace` and `cargo clippy
| 9 | Testing coverage | No issues found in this diff — the fake gateway's `stream_alarms` impl is a minimal stub, but no production behaviour relies on it being exercised by a test. |
| 10 | Documentation & comments | Issues found: Client.Rust-021 (design-doc drift). |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037`. The diff against `d692232` is dominated by
three feature commits — `71d2c39` (ports the cross-language `batch`
subcommand to every CLI; for Rust this adds a `Batch` clap variant plus a
`run_batch` stdin-driven dispatch loop, and a `.cargo/config.toml` setting
`-C link-arg=/STACK:8388608` so debug builds of the now-large clap-derive
Command enum do not overflow Windows's default 1 MB main-thread stack);
`3251069` (ports `BenchReadBulk` and re-adds the bulk read/write SDK
methods to `session.rs``read_bulk`, `write_bulk`, `write2_bulk`,
`write_secured_bulk`, `write_secured2_bulk` plus the matching CLI
subcommands; this commit silently dropped the `BenchReadBulkStats`
helper, the `failureLatencyMs`/`firstFailure` JSON enrichment, and every
test from Client.Rust-015/016/018 that lived on the divergent branch);
and `7de4efe` (adds `stream_alarms` / `acknowledge_alarm` raw SDK methods
on `GatewayClient`, the `AlarmFeedStream` type alias re-exported at the
crate root, two new CLI subcommands `stream-alarms` /
`acknowledge-alarm`, and fixes `event_to_json` to render `MxEventFamily`
as the protobuf enum name so the cross-language e2e parity matcher
recognises events). `8738735` is a README-only update for the new alarm
SDK methods + CLI examples.
`cargo test --workspace` is clean (26 tests across the library,
client-behavior integration suite, and proto-fixture suite). `cargo
clippy --workspace --all-targets -- -D warnings` **fails at HEAD** with
three errors — the same three the previous reviewer flagged in
Client.Rust-021's resolution as "out of scope" (see Client.Rust-029
below): `options.rs:98,143` missing-docs, `galaxy.rs:282` clone-on-copy,
and `session.rs:664` enum-variant-names on `BulkReplyKind`. All three
were resolved at `a020350` (Client.Rust-001/002/012) and regressed when
the crate rename `397d3c5` reverted parts of the prior fix; the diff
under review does not address them.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: the new bulk read/write SDK methods (`read_bulk`, `write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`) silently return `Vec::new()` on malformed-but-OK replies (Client.Rust-022). |
| 2 | mxaccessgw conventions | Issues found: hard-coded correlation IDs in the new CLI subcommands `stream-alarms`, `acknowledge-alarm`, and `bench-read-bulk`'s session-close (Client.Rust-023); clippy regression at HEAD (Client.Rust-029). |
| 3 | Concurrency & thread safety | No issues found — `stream_alarms`/`acknowledge_alarm` go through the existing `unary_request`/`stream_request` path, no new shared mutable state. The `run_batch` stdin loop uses blocking `std::io::Stdin::lock().lines()` on a tokio main task but each dispatched subcommand is spawned on a fresh task so its future does not share the main-thread stack. |
| 4 | Error handling & resilience | No new issues beyond Client.Rust-022's malformed-OK silent default. `bench-read-bulk`'s session close is best-effort and ordered so a bench-loop error does not mask the close (`let _ = close_result;` after `bench_outcome?`). |
| 5 | Security | No issues found — alarm `comment`/`operator` strings flow through the redaction-aware `Error::Display` path; bench's `tags` list is logged in cleartext but tags are not credentials. |
| 6 | Performance & resource management | Issue found: `run_bench_read_bulk` clones the `Vec<String>` tag list on every warmup + steady-state iteration (Client.Rust-026). |
| 7 | Design-document adherence | Issue found: `RustClientDesign.md` is silent on the new alarm SDK methods, the new bulk read/write SDK methods, every new CLI subcommand, and the new `.cargo/config.toml` workaround (Client.Rust-025). |
| 8 | Code organization & conventions | No new issues — the `event_to_json` family-as-enum-name fix is a small, localised change; the bulk-write helpers are factored consistently with the existing bulk-subscribe helpers. |
| 9 | Testing coverage | Issue found: zero tests cover `stream_alarms` on `GatewayClient`, the new bulk read/write SDK methods, or the `BenchReadBulk` flow; the fake gateway's `stream_alarms` impl drops the sender immediately (Client.Rust-024). |
| 10 | Documentation & comments | Issue found: `.cargo/config.toml`'s comment promises "Release builds are unaffected" but the `link-arg=/STACK:8388608` setting is unconditional under `cfg(windows)` and only applies to the MSVC linker (Client.Rust-027). |
## Findings
### Client.Rust-001
@@ -427,3 +473,201 @@ The CLI integration in Client.Rust-014 works either way; this is solely about de
**Recommendation:** Update `RustClientDesign.md:14-33` to describe the actual flat layout: workspace root at `clients/rust/`, top-level crate `zb-mom-ww-mxgateway-client` (declared in `Cargo.toml`) with `src/lib.rs`, `src/client.rs`, etc. directly under it; `crates/mxgw-cli/` is the single member subcrate. Alternatively label the block as "Aspirational nested layout (not currently adopted)" and add a separate "Current layout" section.
**Resolution:** 2026-05-24 — Rewrote the "Crate Layout" section of `clients/rust/RustClientDesign.md` to describe the actual flat layout on disk instead of the aspirational nested form. The section now opens with a short paragraph stating that the workspace is rooted at `clients/rust/`, that `clients/rust/Cargo.toml` declares the top-level crate `zb-mom-ww-mxgateway-client` itself (flat layout — `src/` sits directly under the workspace root, not under `crates/`), and that the only `[workspace.members]` entry is the `mxgw-cli` binary subcrate under `crates/mxgw-cli/`. The accompanying tree replaces the fictitious `crates/zb-mom-ww-mxgateway-client/` subdirectory with the real files: `Cargo.toml`, `Cargo.lock`, `build.rs`, `README.md`, `RustClientDesign.md`, `src/{lib,client,session,galaxy,options,auth,error,value,version,generated}.rs` plus `src/generated/` (tonic-build output), `tests/{client_behavior,proto_fixtures}.rs`, and `crates/mxgw-cli/` annotated as the sole workspace member producing the `mxgw` binary. `cargo build --workspace` and `cargo test --workspace` are clean at HEAD (29 tests pass across the library, integration, and proto-fixture targets). `cargo clippy --workspace --all-targets -- -D warnings` reports three pre-existing errors at HEAD on this dev box (`options.rs:98,143` missing docs, `galaxy.rs:282` `clone_on_copy`, `session.rs:522` `enum_variant_names`) that exist independently of this doc-only change — verified by running clippy with the design-doc edit stashed; tracking those is out of scope for Client.Rust-021.
### Client.Rust-022
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `clients/rust/src/session.rs:369-391,403-420,427-444,452-469,476-493,631-696,706-724` |
| Status | Open |
**Description:** Commit `3251069` re-introduced the bulk read/write SDK methods (`read_bulk`, `write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`) on `Session`. Each method falls back to `Vec::new()` when an OK reply does not carry the expected typed payload arm:
```rust
Ok(match reply.payload {
Some(mx_command_reply::Payload::ReadBulk(reply)) => reply.results,
_ => Vec::new(),
})
```
and `bulk_write_results` does the same for the four write families. A caller of `write_bulk` / `read_bulk` on a malformed-but-OK reply therefore sees an empty result vector and cannot distinguish "zero items processed" from "gateway returned a shapeless reply". This is exactly the anti-pattern Client.Rust-006 fixed in `0d8a28d` (which added an `Error::MalformedReply` variant and made `bulk_results` return `Result<Vec<…>, Error>`), and which Client.Rust-005 fixed for `register_server_handle` / `add_item_handle` / `add_item2_handle` (those three also now fall back to `unwrap_or_default()` returning handle `0` at `session.rs:631-660`, regressing alongside the new methods). The fix lived on the divergent branch and was lost when the d692232 line of history was re-platformed without picking up the post-`a020350` Resolve commits — `Error::MalformedReply` is absent from `clients/rust/src/error.rs` at HEAD.
**Recommendation:** Re-apply the Client.Rust-005/006 resolutions on top of the new methods: add `Error::MalformedReply { detail: String }` back to `error.rs`, change `register_server_handle` / `add_item_handle` / `add_item2_handle` to return `Result<i32, Error>` (yielding `MalformedReply` when the reply lacks an extractable handle), change `bulk_results` and `bulk_write_results` to return `Result<Vec<_>, Error>` (yielding `MalformedReply` on a mismatched / absent payload arm), and route the same fix through the new `read_bulk` inline branch. Re-add the six `…_returns_malformed_reply_…` tests from the Client.Rust-016 resolution to lock the contract in.
### Client.Rust-023
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:835,872,1476` |
| Status | Open |
**Description:** Three CLI subcommands added since `d692232` hard-code their `client_correlation_id`:
```rust
client_correlation_id: "rust-cli-stream-alarms".to_owned(), // line 835
client_correlation_id: "rust-cli-acknowledge-alarm".to_owned(), // line 872
client_correlation_id: "rust-cli-bench-read-bulk-close".to_owned(), // line 1476
```
Every invocation of these subcommands on the same machine — and every iteration of the cross-language e2e matrix that exercises them concurrently — produces requests with identical correlation IDs in gateway logs, defeating the diagnostic value Client.Rust-011/014 unlocked when `next_correlation_id` was made `pub`. (As noted in the d692232 review-row header, the prior `rust-cli-ping`/`rust-cli-close-session` hard-codes are also still in place at lines 506 and 553 after the rename commit reverted the Client.Rust-014 fix; the three new ones extend the same regression to new CLI surface.)
**Recommendation:** Replace the three string literals with calls to the existing `next_correlation_id` helper (already `pub` and intended for raw-RPC consumers): `zb_mom_ww_mxgateway_client::session::next_correlation_id("cli-stream-alarms")`, `next_correlation_id("cli-acknowledge-alarm")`, `next_correlation_id("cli-bench-read-bulk-close")`. While here, restore the same fix at lines 506 and 553 so the CLI surface as a whole shares the library's correlation-id discipline.
### Client.Rust-024
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Testing coverage |
| Location | `clients/rust/tests/client_behavior.rs:405-415`; `clients/rust/src/session.rs:369-493`; `clients/rust/src/client.rs:265-291`; `clients/rust/crates/mxgw-cli/src/main.rs:1310-1505` |
| Status | Open |
**Description:** The diff under review adds substantial SDK and CLI surface with no positive-path coverage:
1. **`GatewayClient::stream_alarms`** (client.rs:280-291) has no test. The fake gateway's `stream_alarms` impl in `tests/client_behavior.rs:408-415` immediately drops the unique sender and returns an empty `ReceiverStream`, so an attached client would observe a clean end-of-stream — nothing exercises the snapshot → `snapshot_complete``transition` sequence that the doc on `AlarmFeedStream` advertises, mid-stream `Status` mapping through `Error::from`, or the cooperative-cancel-on-drop contract.
2. **Five new bulk SDK methods** (`read_bulk`, `write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk` on `Session`) have **zero** tests. The fake gateway's `Invoke` dispatcher does not handle any of `MxCommandKind::ReadBulk`, `WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, so a happy-path round trip would fail at the gateway-side match. The malformed-reply branches in Client.Rust-022 are therefore also untested.
3. **`run_bench_read_bulk`** has no test. The `BenchReadBulkStats` helper that Client.Rust-015/016 added at `a020350` (with its own dedicated unit test) was dropped in the port and there is no replacement — the percentile / mean math, the success-vs-failure accounting, and the JSON schema that `scripts/bench-read-bulk.ps1` collates are all untested.
**Recommendation:** Extend the fake-gateway `Invoke` dispatcher in `tests/client_behavior.rs` with the five new bulk reply arms, add round-trip tests for each (`write_bulk` / `write2_bulk` / `write_secured_bulk` / `write_secured2_bulk` / `read_bulk`), and re-add the six malformed-reply tests from the Client.Rust-016 resolution. Replace the trivial `stream_alarms` stub with one that emits a synthetic `ActiveAlarm``SnapshotComplete``Transition` sequence and assert the client surfaces them in order. For the bench, factor the percentile / accounting helpers out of `run_bench_read_bulk` into a small struct (matching the previous `BenchReadBulkStats`) and add unit tests asserting `latencyMs.{p50,p95,p99,max,mean}` are computed correctly from a hand-built sample.
### Client.Rust-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | `clients/rust/RustClientDesign.md:92-106,142-153,164-171` |
| Status | Open |
**Description:** CLAUDE.md mandates that "When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs ... must change in the same commit." The diff under review adds the following public surface, none of which is reflected in `RustClientDesign.md`:
- **SDK methods** on `GatewayClient`: `stream_alarms` and `acknowledge_alarm`, plus the new `AlarmFeedStream` type alias re-exported at the crate root in `lib.rs:27`.
- **SDK methods** on `Session`: `read_bulk`, `write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`. The `Session` impl block in `RustClientDesign.md:92-106` still lists only the original six bulk-subscribe helpers, identical to its d692232 state (the Client.Rust-019 resolution that added these to the doc lived on the divergent branch and was lost in the `397d3c5` rename).
- **CLI subcommands**: `stream-alarms`, `acknowledge-alarm`, `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`, and `batch`. The `Commands` block in `RustClientDesign.md:166-171` still lists only `version`, `smoke`, `stream-events`, `write`.
- **Error variants**: `Error::MalformedReply` / `Error::Unavailable` / `Error::InvalidEndpoint` are still missing from the `Error` enum sketch at `RustClientDesign.md:142-153` (Client.Rust-017's expansion was also lost in the rename — and Client.Rust-022 will add `MalformedReply` back when fixed).
- **Build configuration**: the new `clients/rust/.cargo/config.toml` Windows 8 MB stack workaround is not mentioned anywhere in the design doc — a reader reproducing the build on a fresh dev box would not know to expect the stack-overflow failure mode the file documents.
**Recommendation:** Bring `RustClientDesign.md` back in sync with the actual implementation. Restore the `Session` API block to enumerate the five new bulk read/write methods alongside the existing six bulk-subscribe helpers (cross-reference Client.Rust-019's recommendation for the correct signatures — no trailing positional `user_id`/`timestamp`/`current_user_id`/`verifier_user_id`, those live on the per-entry structs). Add `stream_alarms` / `acknowledge_alarm` / `AlarmFeedStream` to a new "Alarms" section adjacent to the existing event-stream section. Expand the CLI command list to enumerate every subcommand the binary exposes today. Expand the `Error` enum sketch to match `clients/rust/src/error.rs`. Add a short "Windows build notes" subsection documenting the `.cargo/config.toml` stack workaround and why clap-derive's large `Command` enum needs it.
### Client.Rust-026
| Field | Value |
|---|---|
| Severity | Low |
| Category | Performance & resource management |
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1402-1406,1419-1423` |
| Status | Open |
**Description:** `run_bench_read_bulk` clones the `tags: Vec<String>` on every iteration of both the warmup loop and the steady-state measurement loop:
```rust
while Instant::now() < warmup_deadline {
let _ = session
.read_bulk(server_handle, tags.clone(), timeout_ms_param)
.await;
}
// ...
while Instant::now() < steady_deadline {
let call_start = Instant::now();
let result = session
.read_bulk(server_handle, tags.clone(), timeout_ms_param)
.await;
let elapsed = call_start.elapsed();
latencies_ms.push(elapsed.as_secs_f64() * 1000.0);
// ...
}
```
Each clone allocates one fresh `Vec<String>` plus `bulk_size` heap-allocated `String` clones (default 6, but the flag goes higher; the cross-language matrix routinely runs with `bulk_size=64` and `bulk_size=256`). The allocation happens **inside** `call_start ... call_start.elapsed()`, so the per-call latency the bench reports includes the clone cost — both inflating the absolute number and adding allocator-noise variance to the percentile distribution that the cross-language driver compares against the .NET/Go/Java/Python bench numbers (none of which allocate the tag list per call).
**Recommendation:** Change `Session::read_bulk` to take `tag_addresses: &[String]` or generic over `AsRef<str>` (as Client.Rust-019's recommendation noted is what `read_bulk` was at `a020350`) so the bench can pass `&tags` once and avoid the per-call clone. Alternatively keep the `Vec<String>` signature but borrow the underlying buffer for the hot loop — e.g. build the bench's payload once outside the steady-state window and re-use it. The current `read_bulk(..., Vec<String>, ...)` shape forces the clone at the call site.
### Client.Rust-027
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/rust/.cargo/config.toml:1-9` |
| Status | Open |
**Description:** The new build-config file added by `71d2c39` carries this leading comment:
```
[target.'cfg(windows)']
# Bump the default 1 MB Windows stack to 8 MB. clap-derive builds a large
# Command enum in this CLI (one variant per subcommand, each carrying flag
# args); in debug builds the enum is materialized on the stack without
# optimization and overflows the default Windows main-thread stack before
# even reaching our code. Release builds are unaffected but the e2e matrix
# drives the CLI through `cargo run` (debug), so the link-arg ships with
# every dev-time invocation.
rustflags = ["-C", "link-arg=/STACK:8388608"]
```
Two issues with the documentation vs the configuration:
1. The comment claims "Release builds are unaffected" but the `rustflags` setting is unconditional under `cfg(windows)``cargo build --release` on Windows also produces a binary whose `IMAGE_OPTIONAL_HEADER.SizeOfStackReserve` is 8 MB. The 8 MB stack reservation is in fact harmless for release builds (the optimizer removes the enum from the stack), but the comment misleads a reader trying to understand whether release artifacts of `mxgw` ship with a non-default header value (they do).
2. `/STACK:` is an MSVC-linker (`link.exe` / `lld-link`) directive; `cargo build` on Windows with `+gnu` targets (mingw) routes link args through the GNU linker which rejects `/STACK:` and instead expects `-Wl,--stack,8388608`. The current file silently breaks `x86_64-pc-windows-gnu` builds — a configuration `docs/ToolchainLinks.md` lists no expectation about, but a reader picking up the file does not see the constraint.
**Recommendation:** Either tighten the `[target.…]` selector to only the MSVC-linker tier (`cfg(all(windows, target_env = "msvc"))`) and re-word the comment to make the release-build behaviour explicit ("the stack reservation goes into the PE header for both debug and release builds; release is unaffected at runtime because the optimizer elides the enum from the stack frame"), or add the `+gnu` variant as a parallel block using `-Wl,--stack,8388608`. As a complementary fix, see Client.Rust-024's recommendation — boxing the largest clap-derive variants would let the stack workaround be retired entirely.
### Client.Rust-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | `clients/rust/crates/mxgw-cli/src/main.rs:1126-1166` |
| Status | Open |
**Description:** `run_batch` reads commands from stdin with the blocking `std::io::Stdin::lock().lines()` iterator while the surrounding function is `async fn` and the runtime is `#[tokio::main]` (multi-threaded by default). Each `for line in stdin.lock().lines()` iteration pins one of tokio's worker threads on a blocking syscall (`ReadFile` on Windows), then spawns the dispatch as a separate `tokio::task::spawn(dispatch(cli.command))` — which itself runs on another worker thread — and awaits its `JoinHandle`. The pattern works for a single-threaded harness driver but has two latent problems:
1. **Empty-line sentinel**: `if line.is_empty() { break; }` exits the loop on the first empty line, terminating the batch session. The comment block above declares this as a feature ("stdin EOF (or an empty line) ends the session") but an accidental blank line from any driver — for example a `Write-Output ""` between commands in the PowerShell e2e matrix — silently ends the session and the harness has no way to distinguish "end-of-input" from "the CLI broke after the previous command". The other four language CLIs that implement `batch` use only EOF, not an empty-line sentinel.
2. **Blocking on a tokio worker**: any non-trivial latency in `dispatch` (e.g. a 30-second `bench-read-bulk`) keeps `run_batch` parked on its `JoinHandle.await`, which is fine; but if `dispatch` ever needs to drive the stdin reader (it doesn't today, but the `Batch` subcommand is the natural seam for future "stream input to a subcommand"), the blocking `Stdin::lock()` on the main task would deadlock the runtime worker.
**Recommendation:** Replace the empty-line break with an EOF-only terminator (`if line.trim().is_empty() { println!("{BATCH_EOR}"); stdout.lock().flush().ok(); continue; }`) so accidental blank lines log an empty-EOR-bracketed result instead of ending the session. Optionally wrap the stdin reader in `tokio::task::spawn_blocking` so the runtime can move it onto a dedicated blocking-pool thread; the current shape works for today's harness contract but is brittle if the dispatch future ever needs to share the runtime with stdin reads.
### Client.Rust-029
| Field | Value |
|---|---|
| Severity | High |
| Category | mxaccessgw conventions |
| Location | `clients/rust/src/options.rs:98,143`; `clients/rust/src/galaxy.rs:282`; `clients/rust/src/session.rs:664-671` |
| Status | Open |
**Description:** `cargo clippy --workspace --all-targets -- -D warnings` fails at HEAD `42b0037` with three errors that the prior d692232 reviewer noted as "out of scope for Client.Rust-021" but did not open as a tracked finding:
```
error: missing documentation for a method
--> src\options.rs:98:5
|
98 | pub fn with_max_grpc_message_bytes(mut self, max_grpc_message_bytes: usize) -> Self {
error: missing documentation for a method
--> src\options.rs:143:5
|
143 | pub fn max_grpc_message_bytes(&self) -> usize {
error: using `clone` on type `Option<Timestamp>` which implements the `Copy` trait
--> src\galaxy.rs:282:24
error: all variants have the same postfix: `Bulk`
--> src\session.rs:664:1
|
664 | / enum BulkReplyKind {
665 | | AddItemBulk,
666 | | AdviseItemBulk,
667 | | RemoveItemBulk,
```
All three were resolved at `a020350` (Client.Rust-001, Client.Rust-002, Client.Rust-012 respectively) but regressed when `397d3c5` re-platformed parts of the prior fix and were never re-fixed. CLAUDE.md explicitly mandates that `cargo clippy --workspace --all-targets -- -D warnings` pass cleanly; the documented build/test workflow for the module breaks. The prior reviewer correctly noted the regressions but routed them as "out of scope for Client.Rust-021", which left them invisible to the open-findings count. They are tracked here so they appear in the pending-findings index and are not lost again.
The third error (`BulkReplyKind` enum-variant-names) is also touched by the diff under review: commit `3251069` added a sibling enum `BulkWriteReplyKind` at `session.rs:699-704` whose variants (`Write`, `Write2`, `WriteSecured`, `WriteSecured2`) do not share a suffix and so do not trip the lint — but the pattern is now duplicated. A fix should rename both enums consistently or apply the same scoped `#[allow(clippy::enum_variant_names)]` reason-annotated allow to both.
**Recommendation:** Re-apply Client.Rust-001 (add doc comments on `with_max_grpc_message_bytes` / `max_grpc_message_bytes` in `options.rs`), Client.Rust-002 (drop the `Bulk` suffix from `BulkReplyKind`'s variants so they become `AddItem` / `AdviseItem` / …, or add a narrowly-scoped `#[allow(clippy::enum_variant_names)]` with a reason comment), and Client.Rust-012 (replace `last_deploy.lock().unwrap().clone()` with `*last_deploy.lock().unwrap()` in `galaxy.rs:282`). Verify with `cargo clippy --workspace --all-targets -- -D warnings`. Consider adding a pre-commit / CI gate so the next reviewer never has to discover the regression by running clippy.
+37 -2
View File
@@ -5,8 +5,8 @@
| Module | `src/ZB.MOM.WW.MxGateway.Contracts` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 0 |
## Checklist coverage
@@ -306,3 +306,38 @@ Python and Go descriptors. No fields renumbered or repurposed.
**Recommendation:** Extend the RPC comment with one line: "`QueryActiveAlarmsRequest.alarm_filter_prefix` optionally narrows the snapshot to alarms whose `alarm_full_reference` starts with the given prefix; an empty prefix returns the full set." Comment-only.
**Resolution:** _(2026-05-24)_ Confirmed against `mxaccess_gateway.proto` at `d2d2e5f`: lines 29-34 of the `rpc QueryActiveAlarms` block describe the snapshot order and session-less origin but never mention `alarm_filter_prefix`, which is only documented on the request message itself two lines below — a client author reading the RPC comment alone cannot discover the filter capability. Extended the RPC comment with the finding's recommended one-line addition verbatim: "`QueryActiveAlarmsRequest.alarm_filter_prefix` optionally narrows the snapshot to alarms whose `alarm_full_reference` starts with the given prefix; an empty prefix returns the full set." Comment-only change; the RPC signature, request/reply types, and field numbers are unchanged. The same regression test added under Contracts-016 (`ProtobufContractRoundTripTests.QueryActiveAlarmsRequest_PinsFieldNumbersAndRoundTripsPrefixFilter`) also exercises a round-trip of `QueryActiveAlarmsRequest` with `alarm_filter_prefix` populated and pins `QueryActiveAlarms` on the public `MxAccessGateway` service surface, so the RPC + filter field combination the new comment documents is wire-stable. `dotnet build src/ZB.MOM.WW.MxGateway.slnx` is green (0 warnings, 0 errors).
#### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037` scoped to the contract changes since `d692232`.
The window contains exactly one Contracts commit (`bd1d1f1`, "Resolve
Contracts-016..017"), which is a comment-only edit to
`mxaccess_gateway.proto`: the `rpc QueryActiveAlarms` block gains the
`alarm_filter_prefix` description sentence (Contracts-017 resolution) and
the `QueryActiveAlarmsRequest` header comment replaces the misleading
"reserved" wording with the unambiguous "Clients may leave `session_id`
empty; the gateway currently ignores it…" phrasing (Contracts-016
resolution). The two `Generated/*.cs` edits in the same window are pure
regeneration of those comments into XML doc — no wire-format,
field-number, or type changes. The `StreamAlarms` / `AcknowledgeAlarm`
alarm surface mentioned in the re-review brief was last touched at
`397d3c5` (already covered in the `d692232` pass) and is unchanged in this
window; field numbering across all three protos remains
additive-only with no reuse, renumbering, or type narrowing.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found. The two comment changes are factually correct against the source (`MxAccessGatewayService.QueryActiveAlarms` does ignore `session_id` and does apply the `alarm_filter_prefix` `StartsWith` match). |
| 2 | mxaccessgw conventions | No issues found. The wire-compatibility policy comment block at the top of `mxaccess_gateway.proto` (Contracts-005 resolution) remains intact; the edits are additive comment-only changes consistent with the additive-only contract evolution rule. Generated code under `Generated/` regenerated cleanly from the source comments, not hand-edited. |
| 3 | Concurrency & thread safety | N/A — pure contract definitions and a static constants class. |
| 4 | Error handling & resilience | No issues found. |
| 5 | Security | No issues found — no new credential-bearing fields in the window. |
| 6 | Performance & resource management | No issues found. |
| 7 | Design-document adherence | No drift. `docs/AlarmClientDiscovery.md` already documents the `alarm_filter_prefix` filter and the session-less central-monitor cache; the new RPC-comment line restates what the doc already says, in line with the CLAUDE.md "Update docs in the same change as the source" rule (docs were already current at `d692232`, source caught up at `bd1d1f1`). |
| 8 | Code organization & conventions | No issues found. Field numbers and type names unchanged; additive-only invariant preserved. |
| 9 | Testing coverage | No issues found. `ProtobufContractRoundTripTests.QueryActiveAlarmsRequest_PinsFieldNumbersAndRoundTripsPrefixFilter` (added under Contracts-016) pins the three `QueryActiveAlarmsRequest` field numbers and round-trips an `alarm_filter_prefix`-bearing payload; comment-only proto changes do not need additional test coverage. |
| 10 | Documentation & comments | No issues found. The two edits are themselves documentation fixes that close prior Contracts-016 and Contracts-017 findings. |
Re-review: no new findings. Open finding count remains 0. All seventeen
recorded Contracts findings (Contracts-001..017) remain closed
(Resolved / Won't Fix).
+46 -3
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.MxGateway.IntegrationTests` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 1 |
## Checklist coverage
@@ -59,6 +59,31 @@ a category produced nothing rather than leaving it blank.
| 9 | Testing coverage | Issues found: IntegrationTests-005 (thin MXAccess parity coverage), IntegrationTests-006 (thin LDAP failure-path coverage). |
| 10 | Documentation & comments | Issue found: IntegrationTests-009 (`TestServerCallContext` mislabelled "Mock"). |
### 2026-05-24 re-review (commit `42b0037`)
Re-review pass at `42b0037` scoped to commit `865c22a` (the only commit
touching `src/ZB.MOM.WW.MxGateway.IntegrationTests` between `d692232` and
`42b0037` — all later commits in the range affect dashboard/clients only).
`865c22a` resolved IntegrationTests-022..024: the
`ResolveRepositoryRoot` throw replaces the silent
`Directory.GetCurrentDirectory()` fallback and ships a regression test;
`DashboardLdapLiveTests` adds the `Role: Admin` claim assertion; and
`NullDashboardEventBroadcaster` is extracted into
`TestSupport/NullDashboardEventBroadcaster.cs`.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issue found: IntegrationTests-025 (the new `ResolveRepositoryRoot_NoMarkers_…` test seeds an "isolated" directory under `Path.GetTempPath()` and walks parents, implicitly assuming no ancestor of the temp dir contains `src/` plus `.git`/`*.sln`/`*.slnx`; the assumption is environment-dependent and can flap on machines whose `TMP` is redirected). |
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed` types, accurate XML docs, env-var names preserved. |
| 3 | Concurrency & thread safety | No issues found in this diff. |
| 4 | Error handling & resilience | No issues found. The new `InvalidOperationException` only fires from the live-MXAccess worker-exe resolution path, which is gated by `MXGATEWAY_RUN_LIVE_MXACCESS_TESTS`; non-live tests are unaffected. |
| 5 | Security | No issues found — the new live-LDAP claim assertion adds no new credential surface. |
| 6 | Performance & resource management | No issues found. |
| 7 | Design-document adherence | No issues found. The diff is internal behavior hardening + a private-claim assertion; no public API / env-var / opt-in surface changes, so the CLAUDE.md "update docs in the same change" rule does not require a `docs/GatewayTesting.md` edit. |
| 8 | Code organization & conventions | No issues found. `NullDashboardEventBroadcaster`'s placement under `TestSupport/` matches the unit-test project layout; the deliberate duplication is documented in IntegrationTests-024's resolution. |
| 9 | Testing coverage | No issues found — the new role-claim assertion in `DashboardLdapLiveTests` and the new exception-path test in `IntegrationTestEnvironmentTests` both add coverage; no further gaps surface from this diff. |
| 10 | Documentation & comments | No issues found. XML docs on `ResolveRepositoryRoot` accurately describe the new throw, and the inline IntegrationTests-023 comment correctly explains *why* the role claim is asserted. |
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232` scoped to: the rename (`dc9c0c9`) of the
@@ -461,3 +486,21 @@ The Write parity test (IntegrationTests-012's resolution) added exactly this ass
**Recommendation:** When Tests-025 lands the shared `TestSupport/NullDashboardEventBroadcaster.cs`, either reference the same shared helper from this project (a project reference if practical) or accept the duplication as a deliberate isolation between the unit-test and integration-test trees. Either choice is fine; the current state is the only one that should not persist.
**Resolution:** Resolved 2026-05-24 — Chose option (b) (extract within IntegrationTests) so the unit-test and integration-test projects stay independently buildable without a shared test-helpers project. Moved the inline private class out of `WorkerLiveMxAccessSmokeTests.cs` into a new public type at `src/ZB.MOM.WW.MxGateway.IntegrationTests/TestSupport/NullDashboardEventBroadcaster.cs` (namespace `ZB.MOM.WW.MxGateway.IntegrationTests.TestSupport`), mirroring the layout the unit-test Tests project established for Tests-007/Tests-021. The fixture wires through the same `NullDashboardEventBroadcaster.Instance` singleton, but the constructor is now private so future integration tests can rely on a single shared no-op. A new `using ZB.MOM.WW.MxGateway.IntegrationTests.TestSupport;` in `WorkerLiveMxAccessSmokeTests.cs` is the only change to its `EventStreamService` wiring. Build is green (0 warnings, 0 errors); no regression in the existing tests (4 passed / 15 skipped, identical to the pre-change baseline).
### IntegrationTests-025
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | `src/ZB.MOM.WW.MxGateway.IntegrationTests/IntegrationTestEnvironmentTests.cs:57-84` (`ResolveRepositoryRoot_NoMarkers_ThrowsInvalidOperationExceptionNamingStartAndMarkers`) |
| Status | Open |
**Description:** The new regression test for IntegrationTests-022 builds an "isolated" start directory under `Path.GetTempPath()` (e.g. `C:\Users\<user>\AppData\Local\Temp\<random>\nested` on Windows) and calls `ResolveRepositoryRoot(isolatedStart)`, asserting an `InvalidOperationException` is thrown. The walker walks every parent — `<random>`, `Temp`, `Local`, `AppData`, `<user>`, `Users`, `C:\` — and stops only when it either finds a repository root marker or runs out of parents. The test silently assumes none of those ancestor directories satisfies `IsRepositoryRoot` (a `src/` subdirectory next to `.git` / `*.sln` / `*.slnx`). The assumption is environment-dependent:
1. `Path.GetTempPath()` honors the `TMP` / `TEMP` environment variable. A developer or CI runner that redirects `TMP` to a path under a checkout (e.g. `D:\work\repo\tmp`) — or simply has another git checkout at `C:\` (`C:\src` with a `.git`, common on Windows build agents) — will see the walker stop early at the real repo root and return that path instead of throwing. The `Assert.Throws<InvalidOperationException>` then fails with a misleading "No exception was thrown" message that does not name the path the walker actually returned.
2. The sibling test `ResolveRepositoryRoot_AcceptsGitWorktreeFile` (line 25) is unaffected because it walks from a directory it controls upward into an enclosing directory it also controls — the worktree file marker is encountered before any ancestor of `Path.GetTempPath()` is reached. The new test has the opposite shape: it walks past every directory the test owns into directories it does not control.
The current dev box layout (`C:\Users\dohertj2\Desktop\mxaccessgw`) is safe because Temp is at `C:\Users\dohertj2\AppData\Local\Temp` and the walker exits at `C:\` without ever encountering `src/`. The fragility is invisible on this machine and only surfaces if the test ever runs in CI / on a contributor box with a less hermetic file-system layout.
**Recommendation:** Isolate the walker from any ambient ancestor by either (a) constructing an `isolatedRoot` directly under a drive root and pointing the walker at a chain entirely under it (e.g. create `<isolatedRoot>\level1\level2\level3` and start the walk at `level3`, then assert the throw — the walker stops at the drive root regardless of what is on it), (b) refactoring `ResolveRepositoryRoot` to accept an injectable `stopBoundary` parameter for tests and pass `isolatedRoot` as the boundary, or (c) replacing the `Assert.Throws` shape with an explicit upward-walk check that the test owns. Option (a) is the smallest change: prepend a sentinel — e.g. create a dummy `<isolatedRoot>\sentinel-no-markers` and assert nothing about Temp ancestors — and pass the test only when the walker reaches that sentinel without finding a marker. The current shape is acceptable on the documented dev box but should not be the sole regression coverage for IntegrationTests-022.
+54 -12
View File
@@ -10,23 +10,65 @@ Each module's `findings.md` is the source of truth; this file is generated from
| Module | Reviewer | Date | Commit | Status | Open | Total |
|---|---|---|---|---|---|---|
| [Client.Dotnet](Client.Dotnet/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 17 |
| [Client.Go](Client.Go/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 21 |
| [Client.Java](Client.Java/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 31 |
| [Client.Python](Client.Python/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 21 |
| [Client.Rust](Client.Rust/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 21 |
| [Contracts](Contracts/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 17 |
| [IntegrationTests](IntegrationTests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 24 |
| [Server](Server/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 43 |
| [Tests](Tests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 26 |
| [Worker](Worker/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 25 |
| [Worker.Tests](Worker.Tests/findings.md) | Claude Code | 2026-05-24 | `d692232` | Reviewed | 0 | 30 |
| [Client.Dotnet](Client.Dotnet/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 4 | 21 |
| [Client.Go](Client.Go/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 6 | 27 |
| [Client.Java](Client.Java/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 5 | 36 |
| [Client.Python](Client.Python/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 5 | 26 |
| [Client.Rust](Client.Rust/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 8 | 29 |
| [Contracts](Contracts/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 0 | 17 |
| [IntegrationTests](IntegrationTests/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 1 | 25 |
| [Server](Server/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 7 | 50 |
| [Tests](Tests/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 5 | 31 |
| [Worker](Worker/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 0 | 25 |
| [Worker.Tests](Worker.Tests/findings.md) | Claude Code | 2026-05-24 | `42b0037` | Re-reviewed | 0 | 30 |
## Pending findings
Findings with status `Open` or `In Progress`, ordered by severity.
_No pending findings._
| ID | Severity | Category | Location | Description |
|---|---|---|---|---|
| Client.Java-032 | High | Documentation & comments | `clients/java/README.md:182-183` | Commit `8738735` ("clients: document StreamAlarms + AcknowledgeAlarm in each README") added two new gradle invocations to the CLI Usage block: ``` gradle :zb-mom-ww-mxgateway-cli:run --args="stream-alarms --endpoint localhost:5000 --api-ke… |
| Client.Python-022 | High | Documentation & comments | `clients/python/README.md:201-202`, `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:389-420` | The README CLI examples added by commit `8738735` for the new alarm subcommands cite flags the CLI does not accept: ``` mxgw-py stream-alarms --session-id <id> --max-messages 1 --json mxgw-py acknowledge-alarm --session-id <id> --alarm-ref… |
| Client.Rust-029 | High | mxaccessgw conventions | `clients/rust/src/options.rs:98,143`; `clients/rust/src/galaxy.rs:282`; `clients/rust/src/session.rs:664-671` | `cargo clippy --workspace --all-targets -- -D warnings` fails at HEAD `42b0037` with three errors that the prior d692232 reviewer noted as "out of scope for Client.Rust-021" but did not open as a tracked finding: ``` error: missing documen… |
| Client.Dotnet-018 | Medium | Documentation & comments | `clients/dotnet/README.md:137-138` | The README example block for the two new alarm CLI subcommands shipped in commit `11cc671` shows: ``` mxgw-dotnet stream-alarms --session-id <id> --max-messages 1 --json mxgw-dotnet acknowledge-alarm --session-id <id> --alarm-reference "\\… |
| Client.Go-022 | Medium | Code organization & conventions | `clients/go/cmd/mxgw-go/main.go:398-412,417-519` | Commit `8aaab82` ("Go client: port bulk read/write SDK methods + CLI subcommands") re-introduces every symptom that Client.Go-015 documented and was marked Resolved against an earlier commit: - `runWriteBulkVariant(ctx, args, stdout, stder… |
| Client.Go-023 | Medium | Concurrency & thread safety | `clients/go/cmd/mxgw-go/main.go:604-606,616-632` | `runBenchReadBulk`'s warm-up and steady-state loops are wall-clock-only again: ```go warmupDeadline := time.Now().Add(time.Duration(*warmupSeconds) * time.Second) timeout := time.Duration(*timeoutMs) * time.Millisecond for time.Now().Befor… |
| Client.Java-033 | Medium | Correctness & logic bugs | `clients/java/zb-mom-ww-mxgateway-cli/src/main/java/com/zb/mom/ww/mxgateway/cli/MxGatewayCli.java:1078-1098` | `StreamAlarmsCommand.call()` allocates a bounded `ArrayBlockingQueue<Object>(1024)` and the gRPC observer publishes each `AlarmFeedMessage` via `queue.offer(value)`: ``` BlockingQueue<Object> queue = new ArrayBlockingQueue<>(1024); … @Over… |
| Client.Java-034 | Medium | Correctness & logic bugs | `clients/java/zb-mom-ww-mxgateway-cli/src/main/java/com/zb/mom/ww/mxgateway/cli/MxGatewayCli.java:182-198` | `BatchCommand.call()` reads one CLI invocation per stdin line and tokenises with: ``` String[] args = line.trim().split("\\s+"); … int exitCode = cmd.execute(args); ``` `split("\\s+")` does no shell-quoting parsing — it just splits on whit… |
| Client.Python-023 | Medium | Security | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:901-906` | Client.Python-013 (severity Medium, Security) was marked |
| Client.Python-024 | Medium | Code organization & conventions | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:13,48-119` | The new `batch` subcommand (commit `71d2c39`) implements the cross-language batch protocol by importing `click.testing.CliRunner` into production code and calling `runner.invoke(main, args, catch_exceptions=True)` in a `for raw_line in sys… |
| Client.Rust-022 | Medium | Correctness & logic bugs | `clients/rust/src/session.rs:369-391,403-420,427-444,452-469,476-493,631-696,706-724` | Commit `3251069` re-introduced the bulk read/write SDK methods (`read_bulk`, `write_bulk`, `write2_bulk`, `write_secured_bulk`, `write_secured2_bulk`) on `Session`. Each method falls back to `Vec::new()` when an OK reply does not carry the… |
| Client.Rust-024 | Medium | Testing coverage | `clients/rust/tests/client_behavior.rs:405-415`; `clients/rust/src/session.rs:369-493`; `clients/rust/src/client.rs:265-291`; `clients/rust/crates/mxgw-cli/src/main.rs:1310-1505` | The diff under review adds substantial SDK and CLI surface with no positive-path coverage: 1. **`GatewayClient::stream_alarms`** (client.rs:280-291) has no test. The fake gateway's `stream_alarms` impl in `tests/client_behavior.rs:408-415`… |
| Server-044 | Medium | Correctness & logic bugs | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` | `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increme… |
| Tests-027 | Medium | Concurrency & thread safety | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` | The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name… |
| Client.Dotnet-019 | Low | Correctness & logic bugs | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:745` | Client.Dotnet-005 / 010 documented (and recorded as resolved) the silent register-handle fallback pattern `reply.Register?.ServerHandle ?? reply.ReturnValue.Int32Value`, where a successful protocol+MX-status reply missing its typed `regist… |
| Client.Dotnet-020 | Low | Error handling & resilience | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:792-810`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:774-780` | `BenchReadBulkAsync`'s steady-state `while (DateTime.UtcNow < steadyDeadline)` loop wraps each `client.InvokeAsync(...)` in a bare `catch`: ```csharp try { reply = await client.InvokeAsync( CreateCommandRequest(sessionId, readBulkMxCommand… |
| Client.Dotnet-021 | Low | Correctness & logic bugs | `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:487`, `clients/dotnet/ZB.MOM.WW.MxGateway.Client.Cli/MxGatewayClientCli.cs:715` | Both new bulk-read CLI handlers cast a signed `--timeout-ms` argument to `uint` without bounds checking: ```csharp // ReadBulkAsync (line 487) TimeoutMs = (uint)arguments.GetInt32("timeout-ms", 0), // BenchReadBulkAsync (line 715) uint tim… |
| Client.Go-024 | Low | Testing coverage | `clients/go/mxgateway/session.go:395-525`, `clients/go/mxgateway/alarms.go:65-76` | The five new bulk SDK methods on `Session` and the new `Client.StreamAlarms` method have **no unit tests** in `clients/go/mxgateway/`: - `Session.WriteBulk` (`session.go:395`) - `Session.Write2Bulk` (`session.go:418`) - `Session.WriteSecur… |
| Client.Go-025 | Low | Correctness & logic bugs | `clients/go/mxgateway/session.go:395-485,495-525` | The five new bulk methods (`WriteBulk`, `Write2Bulk`, `WriteSecuredBulk`, `WriteSecured2Bulk`, `ReadBulk`) each guard with `if entries == nil { return error }` and an upper-bound `ensureBulkSize` check, but accept a non-nil empty slice (e.… |
| Client.Go-026 | Low | Error handling & resilience | `clients/go/cmd/mxgw-go/main.go:1196-1222` | `runBatch` reads command lines with a default `bufio.Scanner`: ```go scanner := bufio.NewScanner(in) for scanner.Scan() { ... } return scanner.Err() ``` The default `bufio.Scanner` token size is 64 KiB (`bufio.MaxScanTokenSize`). One long… |
| Client.Go-027 | Low | Code organization & conventions | `clients/go/cmd/mxgw-go/main.go:1195-1206` | `runBatch`'s doc-comment says the loop "never terminates on command error; only stdin EOF (or an empty line) ends the session", and the implementation matches: ```go for scanner.Scan() { line := scanner.Text() if line == "" { break } ... }… |
| Client.Java-035 | Low | Testing coverage | `clients/java/zb-mom-ww-mxgateway-client/src/test/java/com/zb/mom/ww/mxgateway/client/MxGatewayClientSessionTests.java` | Commit `8a0c59d` added `MxGatewayClient.streamAlarms(StreamAlarmsRequest, StreamObserver<AlarmFeedMessage>)` and a new public `MxGatewayAlarmFeedSubscription` class. No library-side test exercises either: a grep for `streamAlarms` across `… |
| Client.Java-036 | Low | Code organization & conventions | `clients/java/zb-mom-ww-mxgateway-client/src/main/java/com/zb/mom/ww/mxgateway/client/MxGatewayAlarmFeedSubscription.java`, `MxGatewayEventSubscription.java`, `MxGatewayActiveAlarmsSubscription.java`, `DeployEventSubscription.java` | `MxGatewayAlarmFeedSubscription` is a structural near-copy of `MxGatewayEventSubscription` — same `AtomicReference<ClientCallStreamObserver<…>>` + `AtomicBoolean cancelled` field shape, the same `wrap(observer)` returning a `ClientResponse… |
| Client.Python-025 | Low | Testing coverage | `clients/python/tests/test_cli.py`, `clients/python/src/zb_mom_ww_mxgateway/{client.py,session.py}`, `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py` | Commits `6add4b4` and `828e3e6` added five new SDK methods (`Session.read_bulk`, `Session.write_bulk`, `Session.write2_bulk`, `Session.write_secured_bulk`, `Session.write_secured2_bulk`), `GatewayClient.stream_alarms`, the helper `_canceli… |
| Client.Python-026 | Low | Correctness & logic bugs | `clients/python/src/zb_mom_ww_mxgateway_cli/commands.py:674-738` | Two minor quality issues in the new `_bench_read_bulk` body (commit `6add4b4`): 1. `import time` is done inside the function body (line 676) rather than at module top. `PythonStyleGuide.md` does not state this explicitly, but every other h… |
| Client.Rust-023 | Low | mxaccessgw conventions | `clients/rust/crates/mxgw-cli/src/main.rs:835,872,1476` | Three CLI subcommands added since `d692232` hard-code their `client_correlation_id`: ```rust client_correlation_id: "rust-cli-stream-alarms".to_owned(), // line 835 client_correlation_id: "rust-cli-acknowledge-alarm".to_owned(), // line 87… |
| Client.Rust-025 | Low | Design-document adherence | `clients/rust/RustClientDesign.md:92-106,142-153,164-171` | CLAUDE.md mandates that "When public APIs, contracts, configuration, build steps, security behavior, event shapes, value conversion, status mapping, or lifecycle rules change, the affected docs ... must change in the same commit." The diff… |
| Client.Rust-026 | Low | Performance & resource management | `clients/rust/crates/mxgw-cli/src/main.rs:1402-1406,1419-1423` | `run_bench_read_bulk` clones the `tags: Vec<String>` on every iteration of both the warmup loop and the steady-state measurement loop: ```rust while Instant::now() < warmup_deadline { let _ = session .read_bulk(server_handle, tags.clone(),… |
| Client.Rust-027 | Low | Documentation & comments | `clients/rust/.cargo/config.toml:1-9` | The new build-config file added by `71d2c39` carries this leading comment: ``` [target.'cfg(windows)'] # Bump the default 1 MB Windows stack to 8 MB. clap-derive builds a large # Command enum in this CLI (one variant per subcommand, each c… |
| Client.Rust-028 | Low | mxaccessgw conventions | `clients/rust/crates/mxgw-cli/src/main.rs:1126-1166` | `run_batch` reads commands from stdin with the blocking `std::io::Stdin::lock().lines()` iterator while the surrounding function is `async fn` and the runtime is `#[tokio::main]` (multi-threaded by default). Each `for line in stdin.lock().… |
| IntegrationTests-025 | Low | Correctness & logic bugs | `src/ZB.MOM.WW.MxGateway.IntegrationTests/IntegrationTestEnvironmentTests.cs:57-84` (`ResolveRepositoryRoot_NoMarkers_ThrowsInvalidOperationExceptionNamingStartAndMarkers`) | The new regression test for IntegrationTests-022 builds an "isolated" start directory under `Path.GetTempPath()` (e.g. `C:\Users\<user>\AppData\Local\Temp\<random>\nested` on Windows) and calls `ResolveRepositoryRoot(isolatedStart)`, asser… |
| Server-045 | Low | Concurrency & thread safety | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` | `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is saf… |
| Server-046 | Low | Error handling & resilience | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` | `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)`… |
| Server-047 | Low | Code organization & conventions | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` | The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently: - `ApiKeysPage.ConfirmPendingAsync` captures the action, sets `PendingAction = null` sy… |
| Server-048 | Low | Testing coverage | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` | The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are m… |
| Server-049 | Low | Documentation & comments | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` | `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumen… |
| Server-050 | Low | Error handling & resilience | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` | `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths e… |
| Tests-028 | Low | Testing coverage | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` | The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chai… |
| Tests-029 | Low | Error handling & resilience | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` | The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNo… |
| Tests-030 | Low | Testing coverage | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` | The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 1… |
| Tests-031 | Low | Concurrency & thread safety | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` | `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). Th… |
## Closed findings
+140 -3
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.MxGateway.Server` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 7 |
## Checklist coverage
@@ -69,6 +69,34 @@ findings (Server-001 through Server-032) are unchanged by this pass.
| 9 | Testing coverage | Issues found: Server-037 (no test for the corrupt-snapshot restore path or for `PersistSnapshot = false` at the cache level). |
| 10 | Documentation & comments | No issues found — XML docs match behavior; the `GalaxyRepository.md` "On-disk snapshot" section documents the Stale-on-restore lifecycle. |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037` scoped to the dashboard destructive-action wave on
top of `d692232`. Seven commits in range: `c5e7479`/`0e56b5b` add the admin-only
Close/Kill flow (new `ISessionManager.KillWorkerAsync`, new
`IDashboardSessionAdminService`, the shared `ConfirmDialog.razor`); `24cc5fd`
adds `IApiKeyAdminStore.DeleteAsync` + the dashboard Delete action on revoked
keys; `c5153d6` chains `base.OnInitializedAsync()` on `ApiKeysPage`; `de7639a`
removes the legacy `MapGet("/", ...)` redirect that was colliding with the
Blazor `@page "/"` (a real 500); `42b0037` is the cosmetic `@using` switch on
`GalaxyPage`/`SessionDetailsPage`. Tests added: `DashboardSessionAdminServiceTests`,
`DashboardSnapshotPublisherTests`, `HubTokenServiceTests`, two new
`SessionManagerTests.KillWorkerAsync_*` cases, plus the API-key-management
delete-path tests.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Issues found: Server-044 (`SessionManager.KillWorkerAsync` catch-path leaks the `mxgateway.sessions.open` gauge — mirror of Server-006 but for the kill path). |
| 2 | mxaccessgw conventions | No issues found — file-scoped namespaces, `sealed` by default, `Async` suffix, primary constructors used by new types (`DashboardSessionAdminService`); no Blazor UI component libraries pulled in (`ConfirmDialog.razor` uses local Bootstrap classes only); no secrets logged. |
| 3 | Concurrency & thread safety | Issues found: Server-045 (`KillWorkerAsync` reads `session.State` without synchronization; two concurrent kills can both observe `wasClosed=false` and double-increment the `sessions.closed` counter). |
| 4 | Error handling & resilience | Issues found: Server-046 (`SessionManager.ShutdownAsync`'s `KillWorker` fallback doesn't call `_metrics.SessionClosed()` — gauge leaks for every session whose graceful close throws), Server-050 (`DashboardSessionAdminService` only catches `SessionManagerException`; any other exception from `RemoveSessionAsync`/`session.DisposeAsync` propagates raw into Blazor). |
| 5 | Security | No issues found — `DashboardSessionAdminService.CanManage` requires `DashboardRoles.Admin` and the Razor pages gate the Close/Kill buttons on it; audit-log events `dashboard-close-session` / `dashboard-kill-worker` / `dashboard-delete-key` write through the existing `IApiKeyAuditStore` pipeline; `IApiKeyAdminStore.DeleteAsync` SQL guards on `revoked_utc IS NOT NULL`. |
| 6 | Performance & resource management | No issues found in the changed code. |
| 7 | Design-document adherence | No issues found in the changed code. |
| 8 | Code organization & conventions | Issues found: Server-047 (`ApiKeysPage.razor` `ConfirmPendingAsync` clears `PendingAction` before awaiting the action while `SessionsPage`/`SessionDetailsPage` clear it in the `finally`; the three consumers of the new `ConfirmDialog` differ on a small but visible UX point). |
| 9 | Testing coverage | Issues found: Server-048 (no test for `KillWorkerAsync` catch-path metric leak, `wasClosed=true` short-circuit, or concurrent-kill double-count). |
| 10 | Documentation & comments | Issues found: Server-049 (`IDashboardSessionAdminService` interface has no XML docs on any of its three members; `DashboardSessionAdminService` has none either — the `IDashboardApiKeyManagementService` peer carries the same gap but the convention is to document new public surfaces per CLAUDE.md "update docs in the same change as the source"). |
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232` scoped to the dashboard refactor wave: the
@@ -778,3 +806,112 @@ Add a regression test that advises N items without an active `StreamEvents` cons
**Recommendation:** Add a `<remarks>` block to `HubTokenService` noting "Registered as a singleton in `AddGatewayDashboard`; the underlying `ITimeLimitedDataProtector` is thread-safe and shared across hub-token issuance and validation." Optionally add a comment near the DI registration explaining the lifetime contract.
**Resolution:** 2026-05-24 — Added a `<remarks>` block to `HubTokenService` (`src/ZB.MOM.WW.MxGateway.Server/Dashboard/HubTokenService.cs`) documenting that the service is registered as a singleton in `DashboardServiceCollectionExtensions.AddGatewayDashboard` and is shared by two consumer scopes — `DashboardHubConnectionFactory` (scoped, per-circuit; calls `Issue` from the cookie-authenticated dashboard) and `HubTokenAuthenticationHandler` (transient, per-request; calls `Validate` from the SignalR negotiate / connection path). Notes that the underlying `ITimeLimitedDataProtector` is thread-safe so concurrent mint/validate from any number of callers is safe, and explicitly asks future maintainers to preserve the singleton lifetime to keep the protector instance stable. Pure documentation change; no test.
## Re-review 2026-05-24 (commit 42b0037)
### Server-044
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-254` |
| Status | Open |
**Description:** `KillWorkerAsync` is the mirror of `CloseSessionCoreAsync` for the new admin-only Kill flow, but its catch path leaks the `mxgateway.sessions.open` gauge — the exact bug that Server-006 closed for `OpenSessionAsync`. The happy path increments `_metrics.SessionClosed()` once after `session.KillWorker(reason)` returns (line 244), which decrements `_openSessions`. The catch path, however, records `_metrics.Fault(...)`, calls `session.MarkFaulted(...)`, and then awaits `RemoveSessionAsync(session)` — but never calls `_metrics.SessionClosed()` (nor `SessionRemoved()`), so a kill that throws from `session.KillWorker` leaves the open-session gauge permanently incremented. `RemoveSessionAsync` only calls `_metrics.RemoveSessionEvents(...)` and `ReleaseSessionSlot()`; neither touches `_openSessions`. Server-006's fix pattern (track whether the open-counter was recorded, and decrement on the failing path) was applied to `OpenSessionAsync` but not propagated to this new write path.
In practice the trigger is narrow — `GatewaySession.KillWorker` calls `_workerClient?.Kill(reason)` and `TransitionTo(SessionState.Closed)`; the worker client's `Kill` method on a faulted client or a worker process that has already exited could throw, and the catch path then leaks the gauge. Sustained operator use of the dashboard Kill action on misbehaving workers would gradually inflate `mxgateway.sessions.open` and corrupt the metric exposed by `/health/metrics` and any Grafana panel keying off it.
**Recommendation:** Mirror Server-006's fix: track whether the session was counted as opened (it always is in `KillWorkerAsync``GetRequiredSession` only succeeds for sessions in the registry, all of which had `SessionOpened()` called), and decrement on the failing path. Concretely, add `_metrics.SessionClosed()` (or `_metrics.SessionRemoved()` if the kill is being treated as an unclean removal) inside the catch block before `RemoveSessionAsync(session)`. The cleanest form is to record `SessionClosed()` once at the top of the method (under a flag), then only re-record if the happy path actually transitions; or to add `_metrics.SessionClosed()` in the catch right after `MarkFaulted`. Add a `SessionManagerTests.KillWorkerAsync_WhenSessionKillThrows_DecrementsOpenSessionGauge` regression test that uses a `FakeWorkerClient.KillThrows = true` to exercise the catch.
### Server-045
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:225,242-245`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:837-841` |
| Status | Open |
**Description:** `KillWorkerAsync` reads `session.State` once into a local `bool wasClosed` (line 225) before calling `session.KillWorker(reason)`. The read is unsynchronized — `State` is a getter that takes `_syncRoot` internally so the read itself is safe, but there is no lock spanning "read state, call KillWorker, conditionally record metric." Two concurrent `KillWorkerAsync` calls on the same session (e.g. one operator clicking Kill on the Sessions page and another clicking Kill on the Session Details page within the same render tick) can both observe `wasClosed = false`, then both call `session.KillWorker(...)` (the second is effectively a no-op because `TransitionTo` refuses to overwrite `Closed`), and both call `_metrics.SessionClosed()` at line 244. The `_openSessions` gauge is bounded at 0 by `GatewayMetrics.SessionClosed`'s `if (_openSessions > 0)` guard, but the `_sessionsClosed` counter (and the `mxgateway.sessions.closed` counter exported by the meter) is double-incremented; `_metrics.Fault` is not used here, so the only mitigation is the SessionsRegistry race — the second call's `GetRequiredSession` could miss if the first already removed the session via `RemoveSessionAsync`, but only if the second arrives after the first's removal completes. The window is small but exists, and the same race exists for "Kill from one tab while the lease-expired sweep is closing the session." `CloseSessionCoreAsync` has the same shape, so this isn't a regression specifically from the kill change — but the new path widens the surface where the issue can fire.
**Recommendation:** Either (a) gate `KillWorkerAsync` on a per-session lock — extending the `_closeLock` pattern that `GatewaySession.CloseAsync` already uses, or introducing a new `_killLock` and accepting that close + kill don't serialize against each other — or (b) accept the metric double-count as harmless and document it on `KillWorkerAsync`'s XML doc. Option (a) is the more defensible long-term fix; option (b) is acceptable for v1 if the metric is purely informational. Adding a test that issues concurrent kills against the same session id and asserts `_sessionsClosed == 1` would pin the chosen behavior either way.
### Server-046
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:286-307` |
| Status | Open |
**Description:** `ShutdownAsync` was updated to fall back to `KillWorker` when `CloseSessionCoreAsync` throws (lines 294-305) — a useful resilience improvement on its own. But the fallback's bookkeeping is wrong: `session.KillWorker(GatewayShutdownReason)` is called and `RemoveSessionAsync(session)` is awaited, but `_metrics.SessionClosed()` is never invoked, so for every session whose graceful close throws, the `mxgateway.sessions.open` gauge stays incremented after shutdown completes. Worse, `CloseSessionCoreAsync`'s `SessionCloseStartedException` catch (line 330) already records `_metrics.SessionRemoved()` (line 334-336) before re-throwing — so for that specific exception type, the gauge is decremented inside the inner catch, then the outer fallback runs and does not double-decrement (good), but `_metrics.SessionClosed()` is never called, so the `_sessionsClosed` counter under-counts by one. For any other exception (the more common case), neither inner catch records anything, so both `_sessionsClosed` and `_openSessions` end up wrong: gauge is left high, counter is left low.
**Recommendation:** Inside the `ShutdownAsync` fallback (after the `KillWorker` call but before/inside the `RemoveSessionAsync`), call `_metrics.SessionClosed()` unless the inner catch already recorded the close. The simplest shape is to propagate a `wasClosed` flag out of `CloseSessionCoreAsync` (or replace the fallback's manual choreography with a single call into `KillWorkerAsync(...)`, which has the right metric path once Server-044 is fixed). The latter is the cleanest — `ShutdownAsync` becomes "try graceful, fall back to `KillWorkerAsync`," and there's exactly one accounting path for each session. Add a `SessionManagerTests.ShutdownAsync_WhenCloseThrows_StillDecrementsOpenSessionGauge` test using a session whose `CloseAsync` throws (e.g. a `BlockingShutdownWorkerClient` configured to throw on `ShutdownAsync`).
### Server-047
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/ApiKeysPage.razor:324-334`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionsPage.razor:171-195`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/Components/Pages/SessionDetailsPage.razor:231-255` |
| Status | Open |
**Description:** The shared `ConfirmDialog.razor` (added in `0e56b5b` / `24cc5fd`) is wired by three pages, but the pages handle `PendingAction` cleanup inconsistently:
- `ApiKeysPage.ConfirmPendingAsync` captures the action, sets `PendingAction = null` synchronously, **then** awaits the action via `RunManagementActionAsync`. The dialog disappears the moment Confirm is clicked, and the user sees no busy indication on the dialog itself (the busy state lives on `RunManagementActionAsync`'s `IsBusy = true`).
- `SessionsPage.ConfirmPendingAsync` and `SessionDetailsPage.ConfirmPendingAsync` keep `PendingAction` set, set `IsBusy = true`, **await** the action, then clear `PendingAction = null` in the `finally`. The dialog stays open during the call and visibly disables its buttons via `IsBusy`.
The user-visible difference: rotating/revoking/deleting a key vs closing/killing a session uses two different dialog-lifecycle patterns. Neither is broken, but `ConfirmDialog` is the shared component and its `IsBusy` parameter exists precisely to render the in-flight state — `ApiKeysPage` discards that signal by closing the dialog before the action runs.
**Recommendation:** Align `ApiKeysPage.ConfirmPendingAsync` with the sessions pages: hold `PendingAction`, set `IsBusy = true`, run the action, then clear `PendingAction` in the `finally`. The current ApiKeysPage shape was inherited from before the dialog existed (when the confirmation was a `confirm()` JS call); the dialog component change can flatten the difference now. As a smaller alternative, document the divergence on the component's XML doc — but the shared component should ideally be used consistently.
### Server-048
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:463-498` |
| Status | Open |
**Description:** The two new `KillWorkerAsync_*` tests cover the happy path (`KillWorkerAsync_KillsWorkerAndRemovesSession`) and the missing-session error (`KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound`). Three behaviorally distinct cases are missing:
1. **Catch path**`session.KillWorker` throws. Today no test exercises the failure branch, so the Server-044 gauge leak ships without coverage. A `FakeWorkerClient.ThrowOnKill = true` (or equivalent) would let `WorkerClient.Kill` throw; the test would assert the session is removed, `_metrics.Fault` is recorded, and the open-session gauge is decremented.
2. **`wasClosed = true` path** — kill on an already-`Closed` session must not re-increment `mxgateway.sessions.closed`. No assertion pins this.
3. **Concurrent kill** — two `KillWorkerAsync` calls for the same session id; Server-045's double-increment lives or dies on whether this is tested.
**Recommendation:** Add the three tests above. The fakes in `MxGateway.Tests/TestSupport/` already cover most of the moving parts; `FakeWorkerClient` needs a single `ThrowOnKill` flag (or the existing `KillThrowing` if any).
### Server-049
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/IDashboardSessionAdminService.cs:5-18`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:8-25` |
| Status | Open |
**Description:** `IDashboardSessionAdminService` declares three members — `CanManage`, `CloseSessionAsync`, `KillWorkerAsync` — none of which carry XML documentation. `DashboardSessionAdminService.CanManage` and the two operation methods are also undocumented (only the constructor parameters are named). The C# style guide requires public-surface XML docs and CLAUDE.md mandates that "docs change with the code." The peer `IDashboardApiKeyManagementService` is also undocumented, so this isn't unique — but the new interface is a fresh public surface being landed in `c5e7479`, and the contract subtleties (CanManage returns false for non-Admin; missing-session paths surface as `Succeeded = false` not as a thrown exception; `KillReason` is fixed at `"dashboard-admin-kill"` and that value reaches the audit log) are exactly what XML docs are for.
**Recommendation:** Add `<summary>` blocks to `IDashboardSessionAdminService.CanManage` (states the Admin-role gate), `CloseSessionAsync` and `KillWorkerAsync` (state that missing sessions return `DashboardSessionAdminResult.Fail(...)` rather than throwing, and that the audit log captures actor + remote IP). Add `<param>` and `<returns>` for the request/response shape. The same sweep can pick up the longstanding gap on `IDashboardApiKeyManagementService` if the team wants — but the new file is the load-bearing one.
### Server-050
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:42-75,92-125` |
| Status | Open |
**Description:** `CloseSessionAsync` and `KillWorkerAsync` catch only `SessionManagerException` (the `SessionNotFound` filter, then a general `SessionManagerException` catch). Anything else propagates raw to Blazor's error boundary. The propagation paths exist:
- `SessionManager.CloseSessionAsync``CloseSessionCoreAsync` catches `OperationCanceledException` and `SessionCloseStartedException`; any other exception (e.g. an `IOException` from worker pipe teardown surfacing through `session.CloseAsync``_workerClient.ShutdownAsync`) propagates raw.
- `SessionManager.KillWorkerAsync` wraps only the `session.KillWorker(reason)` call in a try/catch. Exceptions from `RemoveSessionAsync``session.DisposeAsync` (the new `_closeLock.WaitAsync` / dispose choreography from Server-016) — particularly a `TaskCanceledException` if the caller's CancellationToken fires mid-dispose, or an aggregate exception from concurrent disposal — also propagate raw.
Today neither call site has a Blazor error boundary, so an unhandled exception lands as a generic Blazor circuit error page. The friendlier-error contract that Server-044's commit message advertises ("audit-logs, friendly errors") is incomplete: only `SessionManagerException` gets a friendly error.
**Recommendation:** Add a general `catch (Exception exception)` after the `SessionManagerException` catch in both `CloseSessionAsync` and `KillWorkerAsync`, log a warning (matching the SessionManagerException pattern), and return `DashboardSessionAdminResult.Fail($"{operation} failed unexpectedly. See the gateway log for details.")`. This makes the result type truly the only output the page sees. Add a regression test using a `ThrowingSessionManager` that throws e.g. `InvalidOperationException` from `KillWorkerAsync` and asserts the service returns a failing result rather than propagating.
+97 -3
View File
@@ -5,9 +5,9 @@
| Module | `src/ZB.MOM.WW.MxGateway.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Open findings | 0 |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 5 |
## Checklist coverage
@@ -26,6 +26,35 @@ This pass (commit `a020350`) re-reviews the module after the Tests-013019 bat
| 9 | Testing coverage | Issues found: Tests-020 (`MxAccessGatewayServiceConstraintTests` covers only 2 of 4 `WriteBulkConstraintPlan` switch arms — `Write2Bulk`/`WriteSecured2Bulk` `GetPayload`/`SetPayload` would silently break with no failing test), Tests-022 (the eleven `SessionManagerBulkTests.*_PropagatesCancellation` tests pre-cancel the token, so the fake's first-line `ThrowIfCancellationRequested` handles it before `InvokeBulkInternalAsync` even runs — they do not exercise mid-flight cancellation), Tests-024 (`BulkConstraintPlan.MergeDeniedInto` silently drops or under-fills if the worker reply count diverges from the allowed-count — no test pins this protocol-mismatch edge case). |
| 10 | Documentation & comments | No new issues. Tests-019's `docs/GatewayTesting.md` addition is in place; new test files (`SessionManagerBulkTests`, `MxAccessGatewayServiceConstraintTests`, `PredicateConstraintEnforcer`) all have orienting class-level summaries. |
### 2026-05-24 re-review (commit 42b0037)
Re-review pass at `42b0037` after the dashboard admin-actions wave: the new
`DashboardSessionAdminServiceTests` covers Close/Kill viewer/admin gating, the
extended `SessionManagerTests.KillWorkerAsync_*` tests cover the
SessionManager-side kill path, `DashboardApiKeyManagementServiceTests` gains
DeleteAsync coverage, and `EventStreamServiceTests` adds a recording
broadcaster fixture and a throwing-broadcaster fixture. Four FakeSessionManager
impls were extended with `KillWorkerAsync` stubs to keep the interface compiling.
Also new: `DashboardSnapshotPublisherTests` (Server-042 reconnect loop),
`HubTokenServiceTests` (Server-039 null-identity rejection), and a large
`ProtobufContractRoundTripTests` expansion for the new bulk write/read
payload oneof cases.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in this diff. |
| 2 | mxaccessgw conventions | No issues found in this diff. |
| 3 | Concurrency & thread safety | Issue found: Tests-027 (the known-flake `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` cross-talks via the shared `MxGateway.Server` Meter when parallel tests record `mxgateway.events.stream_send.duration`). |
| 4 | Error handling & resilience | Issue found: Tests-029 (`DashboardSessionAdminService.KillWorkerAsync` has a `SessionNotFound` and a general `SessionManagerException` catch branch; neither is tested. Also `CloseSessionAsync` has no blank-id test even though Kill does — symmetric guard, asymmetric coverage). |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. |
| 8 | Code organization & conventions | No issues found in this diff — the `NullDashboardEventBroadcaster` consolidation (Tests-025) is in place. |
| 9 | Testing coverage | Issues found: Tests-028 (`KillWorkerAsync_KillsWorkerAndRemovesSession` does not pin the `reason` argument propagating through `SessionManager.KillWorkerAsync``session.KillWorker(reason)``IWorkerClient.Kill(reason)`; the fake's `Kill(string reason)` discards the argument and there is no `KillWorkerAsync_WithBlankReason_ThrowsArgument` test), Tests-030 (`DashboardApiKeyManagementService.DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` does not inject an audit store or assert the `"not-found-or-active"` audit entry the product code emits on the failure path, and there is no `DeleteAsync_BlankKeyId_ReturnsFailure` for the `ValidateKeyId` guard). |
| 10 | Documentation & comments | No issues found in this diff. |
Flake-risk note: Tests-031 (`DashboardSnapshotPublisherTests.ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` measures the reconnect gap from `startedAt` to `secondSubscribeAt` with only 10 ms slack against a 50 ms `reconnectDelay`; Windows `Task.Delay` timer granularity is ~15 ms and the gap baseline includes scheduling overhead, so the lower-bound assertion can spuriously fail on a slow CI agent).
### 2026-05-24 review (commit d692232)
Re-review pass at `d692232` scoped to the test-side fixture churn from
@@ -451,3 +480,68 @@ The cancellation tests for `WorkerClient` in `WorkerClientTests` *do* exercise t
**Recommendation:** Add a fixture to `EventStreamServiceTests` named e.g. `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster`: inject a recording fake that captures `(sessionId, mxEvent)` calls, push two events through the fake session, and assert the broadcaster received both with the correct session id and matching `WorkerSequence`. This pins the broadcast hook and proves the dashboard event mirror is not a no-op.
**Resolution:** 2026-05-24 — Added `src/ZB.MOM.WW.MxGateway.Tests/TestSupport/RecordingDashboardEventBroadcaster.cs` — a thread-safe `IDashboardEventBroadcaster` test double backed by a `ConcurrentQueue<DashboardEventCapture>` that captures every `(sessionId, mxEvent)` invocation. Added `StreamEventsAsync_PublishesEachEventToDashboardBroadcaster` to `EventStreamServiceTests`: pushes two events (`WorkerSequence` 7 / `OnDataChange` and 8 / `OnWriteComplete`) through the fake session, drains the stream, and asserts the recording broadcaster captured exactly two publishes with the matching `SessionId`, `WorkerSequence`, and `Family` for each. TDD red/green confirmed: a deliberate "expected 3 captures" failed (`Expected: 3, Actual: 2`) before flipping to the correct count; the green run passes deterministically. The fixture would have caught a regression that drops or wraps `dashboardEventBroadcaster.Publish` at `EventStreamService.cs:133`. Build clean (0 warnings); full Tests suite green (486/486).
### Tests-027
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Grpc/MxAccessGatewayServiceTests.cs:199-240`, `src/ZB.MOM.WW.MxGateway.Server/Metrics/GatewayMetrics.cs:8,73,246-251` |
| Status | Open |
**Description:** The review brief explicitly flagged `MxAccessGatewayServiceTests.StreamEvents_WhenEventIsWritten_RecordsSendDuration` as a known flake that "passed solo on rerun". The root cause is the `MeterListener` subscribes by `instrument.Meter.Name == GatewayMetrics.MeterName` (a *process-shared* constant `"MxGateway.Server"`), not by the specific `GatewayMetrics` instance constructed in the test. Tests-012 made the xUnit parallelism policy explicit (`parallelizeTestCollections: true`, `maxParallelThreads: -1`), and every other test that builds its own `GatewayMetrics()` and exercises `MxAccessGatewayService.StreamEvents` or `EventStreamService.StreamEventsAsync` (e.g. the new `StreamEventsAsync_*` family added by Tests-026 and Server-041, plus the pre-existing `StreamEventsAsync_YieldsEventsInWorkerOrder` etc.) routes through `GatewayMetrics.RecordEventStreamSend` → the same histogram name `mxgateway.events.stream_send.duration`. When two such tests run concurrently in the same xUnit process, the `MeterListener` in this test sees measurements from *both* meters and `families.Count` grows to >1, breaking `Assert.Equal([MxEventFamily.OnDataChange.ToString()], families)`. Solo reruns pass because no other producer is alive. This is exactly the cross-test mutable-state pattern Tests-012 set the guardrail comment against.
**Recommendation:** Either (a) filter the `MeterListener` callback by the specific `Meter` instance — capture `metrics._meter` (or expose `GatewayMetrics.Meter`) and compare with `ReferenceEquals(instrument.Meter, expectedMeter)` instead of comparing `Meter.Name`; or (b) place this test in a single-threaded `[Collection("GatewayMetrics-Listener")]` so no other `RecordEventStreamSend` producer runs concurrently. Option (a) is preferred because it removes the cross-talk vector permanently and lets the test stay parallelisable.
### Tests-028
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Sessions/SessionManagerTests.cs:466-496,802-807`, `src/ZB.MOM.WW.MxGateway.Server/Sessions/SessionManager.cs:216-253` |
| Status | Open |
**Description:** The new `KillWorkerAsync_KillsWorkerAndRemovesSession` (line 466) and `KillWorkerAsync_WhenSessionMissing_ThrowsSessionNotFound` (line 486) pin the new kill-path entry, but they do not pin the `reason` argument propagating through the chain. `SessionManager.KillWorkerAsync(sessionId, reason, ct)` validates `reason` with `ArgumentException.ThrowIfNullOrWhiteSpace(reason)` (line 221), calls `session.KillWorker(reason)` (line 229), and logs `reason={Reason}` (line 251); but the `FakeWorkerClient.Kill(string reason)` discards the argument (line 803-807) and the assertion is only `Assert.Equal(1, workerClient.KillCount)`. A regression that (a) hard-coded an internal `"unspecified"` reason between `SessionManager` and `GatewaySession`, (b) swapped to a different overload that dropped the reason, or (c) deleted the `ThrowIfNullOrWhiteSpace` guard would all pass the current tests. The dashboard caller (`DashboardSessionAdminService.KillWorkerAsync`) passes a hard-coded `"dashboard-admin-kill"` reason and the only test that observes it (`KillWorkerAsync_AdminKillsWorker`) asserts `!string.IsNullOrWhiteSpace(LastKillReason)` rather than pinning the value — so the same-class drift is also untested.
**Recommendation:** (1) Capture `LastKillReason` on `FakeWorkerClient.Kill` and assert `KillWorkerAsync_KillsWorkerAndRemovesSession` propagates the test-supplied `"test-kill"` string end-to-end. (2) Add `KillWorkerAsync_WithBlankReason_ThrowsArgumentException` (parameterised over `null`, `""`, `" "`) to pin the `ArgumentException.ThrowIfNullOrWhiteSpace` guard. (3) Tighten `DashboardSessionAdminServiceTests.KillWorkerAsync_AdminKillsWorker` to `Assert.Equal("dashboard-admin-kill", sessionManager.LastKillReason)` so a future reason-string change is a deliberate test update.
### Tests-029
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSessionAdminServiceTests.cs:61-106,139-222`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardSessionAdminService.cs:77-125` |
| Status | Open |
**Description:** The new `DashboardSessionAdminServiceTests` covers the happy path and the viewer-denial path for both `CloseSessionAsync` and `KillWorkerAsync`, plus `CloseSessionAsync_WhenSessionMissing_ReportsFriendlyError` for the close-side `SessionNotFound` catch — but the kill-side error branches are not tested. The product code's `KillWorkerAsync` (lines 111-114) has the same `SessionNotFound` catch returning `"Session {id} was not found."` and (lines 115-124) a generic `SessionManagerException` catch returning `"Kill failed: {message}"`; neither is exercised. The fake's `KillWorkerAsync` (lines 200-209) only succeeds — there is no `KillThrowsNotFound` / `KillThrowsGeneric` configuration option matching the existing `CloseThrowsNotFound`. Symmetrically, `CloseSessionAsync` has the same `IsNullOrWhiteSpace(sessionId)` guard (line 37-40) but no blank-id test even though `KillWorkerAsync_BlankSessionId_ReturnsFailure` exists for the parallel kill guard — a guard-removal regression on close would slip through.
**Recommendation:** Mirror the existing close-side fixtures onto the kill side: add `KillThrowsNotFound` / `KillThrowsGeneric` init-flags to the `FakeSessionManager`, then `KillWorkerAsync_WhenSessionMissing_ReportsFriendlyError`, `KillWorkerAsync_WhenSessionManagerThrows_ReportsKillFailedMessage`, and `CloseSessionAsync_BlankSessionId_ReturnsFailure`. These are mechanical copies of the existing patterns and bring close/kill coverage into symmetry.
### Tests-030
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardApiKeyManagementServiceTests.cs:115-163`, `src/ZB.MOM.WW.MxGateway.Server/Dashboard/DashboardApiKeyManagementService.cs:146-177` |
| Status | Open |
**Description:** The three new `DeleteAsync_*` fixtures cover unauthorised user, success path with audit, and store-refuses-with-friendly-error. They do not exercise two production behaviours: (1) `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` (line 151-163) does not construct or inject a `FakeApiKeyAuditStore`, so it never observes that the product code still emits an audit entry with `EventType = "dashboard-delete-key"` and `Details = "not-found-or-active"` on the failure branch (`AppendAuditAsync` runs unconditionally at line 167-172). A regression that placed the `AppendAuditAsync` call inside the `if (deleted)` branch would silently drop the audit trail for refused deletes — a real audit-completeness gap. (2) There is no `DeleteAsync_BlankKeyId_ReturnsFailure` or `DeleteAsync_InvalidKeyId_ReturnsFailure` test, even though `ValidateKeyId(keyId)` (line 156-160) guards on the same conditions as Create/Revoke/Rotate. The `Revoke`/`Rotate` paths have equivalent fixtures (the file's earlier tests cover them); only Delete is missing them.
**Recommendation:** (1) Add a `FakeApiKeyAuditStore` to `DeleteAsync_WhenStoreRefuses_ReportsFriendlyError` and assert it contains exactly one entry with `EventType == "dashboard-delete-key"` and `Details == "not-found-or-active"`. (2) Add `DeleteAsync_BlankKeyId_ReturnsFailure` (parameterised over `null`, `""`, `" "`) and `DeleteAsync_InvalidKeyId_ReturnsFailure` (a keyId with characters the `ValidateKeyId` rules reject) to pin the validation branch end-to-end.
### Tests-031
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `src/ZB.MOM.WW.MxGateway.Tests/Gateway/Dashboard/DashboardSnapshotPublisherTests.cs:22-61` |
| Status | Open |
**Description:** `ExecuteAsync_WhenSnapshotServiceThrowsOnce_ReconnectsAfterDelay` records `startedAt = DateTimeOffset.UtcNow` *before* calling `publisher.StartAsync(...)`, then asserts `secondSubscribeAt - startedAt >= reconnectDelay - 10ms` (line 59). The measured gap is *not* the reconnect delay in isolation — it is `(StartAsync scheduling) + (first WatchSnapshotsAsync setup + Task.Yield) + (throw) + reconnect delay + (second WatchSnapshotsAsync setup)`. On a slow/contended CI agent the first three terms easily dominate (favouring the assertion); but on a fast machine, Windows `Task.Delay(50ms)` rounds up to the next ~15.6 ms tick boundary and may return at ~46-50 ms relative to schedule, while the first three terms can be sub-millisecond — so the gap measurement can land within 1-2 ms of the lower bound, and the 10 ms slack may not absorb a single missed quantum. This is a latent flake of the same flavour as Tests-006 (heartbeat timing) but on a wall-clock dependency the test cannot inject around because `DashboardSnapshotPublisher` uses `Task.Delay(_reconnectDelay)` directly. Tests-006 / Tests-017 moved heartbeat tests onto `ManualTimeProvider`; this test cannot do that without a product change to use a `TimeProvider`-aware delay.
**Recommendation:** (a) The cheap fix: have `ThrowOnceThenYieldSnapshotService` record `_firstThrowAt = DateTimeOffset.UtcNow` immediately before the `throw`, and change the assertion to `secondSubscribeAt - firstThrowAt >= reconnectDelay - 10ms` — the gap then measures only the reconnect delay, eliminating the variable scheduling baseline. (b) The deeper fix: extend `DashboardSnapshotPublisher` to accept an `ITimeProvider`-style delay seam (or a virtual `DelayAsync` hook) so a `ManualTimeProvider` could advance time deterministically. (a) is preferred for now; (b) belongs as a follow-up if more reconnect-loop tests are added.
+6 -2
View File
@@ -5,10 +5,14 @@
| Module | `src/ZB.MOM.WW.MxGateway.Worker.Tests` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 0 |
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker.Tests` returns empty — the Worker.Tests module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker.Tests-001..030 remain closed.
## Checklist coverage
### 2026-05-18 review (commit `6c64030`)
+6 -2
View File
@@ -5,10 +5,14 @@
| Module | `src/ZB.MOM.WW.MxGateway.Worker` |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | `d692232` |
| Status | Reviewed |
| Commit reviewed | `42b0037` |
| Status | Re-reviewed |
| Open findings | 0 |
## 2026-05-24 re-review (commit `42b0037`)
**Re-review: no new findings.** `git diff --name-only d692232..42b0037 -- src/ZB.MOM.WW.MxGateway.Worker` returns empty — the Worker module has zero source changes since the previous review. All ten checklist categories therefore inherit "No issues found" from the `d692232` pass. The header is bumped to track the latest reviewed commit; Worker-001..025 remain closed.
## Checklist coverage
This row reflects the 2026-05-20 re-review at commit `a020350`. Worker-001..022 are all closed; the row only summarises new findings filed against this commit. The prior pass's fixes for Worker-016..022 were verified sound: