Code-review 2026-05-20 sweep #2: re-review at a020350, resolve 48 findings

Second re-review pass at commit a020350 caught 48 new findings — including
one High-severity regression I introduced in the prior sweep — and fixed
them all in one parallel wave.

High (1)
- Client.Python-018: prior sweep set `license = "Proprietary"` in
  pyproject.toml. setuptools >= 77 enforces PEP 639 and rejects the
  string (it must be a valid SPDX expression), so `pip wheel .` and
  `pip install -e .` both fail before any source compiles. Tests
  still pass because pytest bypasses the build backend via
  `pythonpath`. Dropped the invalid license string, kept the
  `License :: Other/Proprietary License` classifier, and added
  `tests/test_packaging.py` so a future regression of the same shape
  is caught in CI.

Mediums (6)
- Worker-023: `HeartbeatStuckCeiling` (default 75s = 5x HeartbeatGrace)
  on WorkerPipeSessionOptions bounds the in-flight-command watchdog
  suppression so a truly stuck COM call still triggers StaHung
  instead of permanently defeating the watchdog.
- Client.Rust-018: reverted Rust's `latencyMs` split so the
  cross-language bench comparison is apples-to-apples again;
  `failureLatencyMs` kept as Rust-only enrichment.
- Client.Java-021: applied Client.Java-002's terminal-state
  serialisation pattern to DeployEventStream so close() arriving
  after queue-overflow can't erase the overflow exception.
- IntegrationTests-017: teardown-parity test now uses a two-window
  stability check after UnAdvise instead of strict equality against
  the pre-UnAdvise count (which raced against in-flight events).
- IntegrationTests-019: new RecordingTestOutputHelper wraps every
  log sink the WriteSecured live test owns (worker stdout/stderr,
  gateway logs, direct WriteLine) so the credential is proven
  absent from the full output buffer, not just the diagnostic
  message.
- Tests-020: added MxAccessGatewayServiceConstraintTests coverage
  for the previously-uncovered Write2Bulk and WriteSecured2Bulk
  arms of WriteBulkConstraintPlan.SetPayload.

Lows (41 — highlights)
- Server: Galaxy glob cache eviction is race-free (Server-024);
  GalaxyRepositoryGrpcService takes IGalaxyRepository (Server-025);
  AlarmsOptions validated at startup (Server-026); Authorization.md
  Constraint Enforcement snippet/prose enumerate the bulk write/read
  family (Server-027); bulk-read-commands and bulk-write-commands
  capability tokens added to OpenSession (Server-029);
  NotWiredAlarmRpcDispatcher XML doc and missing scope-resolver and
  state-machine tests cleaned up (023, 028).
- Worker: AlarmCommandHandler now invokes the same STA-affinity
  guard the poll path uses, at every command entry (Worker-024);
  RunAsync null-checks the runtime-session factory result
  (Worker-025).
- Worker.Tests: shared LiveMxAccessOptInVariableName lives on
  GatewayContractInfo (Worker.Tests-025); MxAccessSession.CreateForTesting
  rejects production sinks (Worker.Tests-026); FakeRuntimeSession's
  CancelCommandReturnValue serialised under lock (Worker.Tests-027);
  Probes namespace lifted to MxGateway.Worker.Tests.Probes
  (Worker.Tests-029); cancel-envelope sequence numbers monotonised
  (Worker.Tests-030); docs/GatewayTesting.md gains a "Dev-rig Probes"
  section (Worker.Tests-028).
- Tests: ManualTimeProvider consolidated into one TestSupport/ copy
  (Tests-021); SessionManagerBulkTests adds a mid-flight cancellation
  test backed by a TaskCompletionSource fake (Tests-022); companion
  FakeWorkerProcess.WaitForExitAsync no longer fakes its exit signal
  (Tests-023); constraint plan reply-count divergence pinned
  (Tests-024).
- IntegrationTests: TryGetSession chain carries [MaybeNullWhen(false)]
  end-to-end (IntegrationTests-018); abnormal-exit keyword set
  tightened to pipe-disconnected/end-of-stream and the test now
  asserts streamTask.IsFaulted (020, 021).
- Client.Dotnet: bench commands added to isLongRunning so the
  default 30s wall-clock budget doesn't kill them (015);
  BenchStreamEventsAsync observes the inner stream task on every
  exit path (016).
- Client.Go: parseValue wraps strconv errors with flag context and
  %w (017); bench loops honour ctx.Done() (018); galaxy-watch parses
  RFC3339Nano with fractional seconds (019); runStreamEvents installs
  signal.NotifyContext like runGalaxyWatch (020); five new CLI-level
  table-driven tests cover the bulk/bench subcommands (021).
- Client.Java: toCompletable Javadoc rewritten to match the actual
  cancellation contract Client.Java-015 established (022); stream-events
  text path uses Long.toUnsignedString for worker_sequence (023);
  bench-read-bulk no longer pollutes success-latency histogram with
  failure durations (024); --shutdown-timeout CLI option propagates
  through to ClientOptions (025); seven new MxGatewayCliTests cover
  the bulk and bench commands (026).
- Client.Python: mxgateway_cli ships its own py.typed marker (019);
  wheel-build smoke test added under tests/test_packaging.py (020);
  README documents the Galaxy CLI parity gap explicitly (021).
- Client.Rust: RustClientDesign.md signatures match session.rs and
  document the AsRef<str> read_bulk genericism (019);
  next_correlation_id re-exported at the crate root, with a
  property-style doc contract and an explicit disclaimer that the
  literal textual format is not part of the contract (020).
- Contracts: BulkWriteResult comment names the actual
  IConstraintEnforcer mechanism instead of "tag-allowlist filter"
  (014); BulkReadResult gains explicit per-arm payload-population
  documentation for the success vs failure cases (015).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This commit is contained in:
Joseph Doherty
2026-05-20 10:28:54 -04:00
parent a0203503a7
commit 1aafd6bde4
74 changed files with 3349 additions and 395 deletions
+141 -11
View File
@@ -5,24 +5,27 @@
| Module | `clients/go` |
| Reviewer | Claude Code |
| Review date | 2026-05-20 |
| Commit reviewed | `1cd51bb` |
| Commit reviewed | `a020350` |
| Status | Reviewed |
| Open findings | 0 |
## Checklist coverage
A re-review of commit `a020350` (which resolved Client.Go-011..016). `gofmt -l .`,
`go vet ./...`, `go build ./...`, and `go test ./... -count=1` are all clean.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Re-review: previous Client.Go-001/003/007 remain resolved. New issue: a dead/no-op test condition in `alarms_test.go` (Client.Go-011). |
| 2 | mxaccessgw conventions | `gofmt -l ./...` and `go vet ./...` are clean. No new issues. |
| 3 | Concurrency & thread safety | New issue: `runGalaxyWatch` limit-reached path returns without waiting for the WatchDeployEvents goroutine to drain (Client.Go-013). |
| 4 | Error handling & resilience | New issue: direct `err == io.EOF` comparisons should use `errors.Is` for chain robustness (Client.Go-014). |
| 5 | Security | No issues found — TLS-by-default with TLS 1.2 floor, API key redaction in CLI JSON, no secret logging. |
| 6 | Performance & resource management | No issues found — `defer client.Close()` / `defer subscription.Close()` consistently applied across CLI and library; bench-read-bulk preallocates latency slice. |
| 7 | Design-document adherence | No new issues. The lazy `grpc.NewClient` + readiness probe migration (Client.Go-005) was applied uniformly to `Dial` and `DialGalaxy`. |
| 8 | Code organization & conventions | New issue: `runWriteBulkVariant`'s `secured` parameter is computed but unused (Client.Go-015). |
| 9 | Testing coverage | Coverage holes from prior review now filled (Client.Go-008). `fakeGalaxyServer.watchSendInterval` is declared but never setminor test cruft (Client.Go-016). |
| 10 | Documentation & comments | New issue: the CLI `writeUsage` line is missing the six bulk and bench subcommands now wired into `run` (Client.Go-012). |
| 1 | Correctness & logic bugs | Prior Client.Go-001/003/007/011 remain resolved. No new correctness bugs found. |
| 2 | mxaccessgw conventions | `gofmt -l .` and `go vet ./...` clean; Client.Go-004 stays resolved. No new issues. |
| 3 | Concurrency & thread safety | Client.Go-013 resolved. New issue: `runBenchReadBulk`'s warm-up + steady-state wall-clock loops ignore `ctx` cancellation, so a Ctrl+C or parent-cancel keeps spinning ReadBulk calls until the wall-clock deadline (Client.Go-018). |
| 4 | Error handling & resilience | Client.Go-014 resolved. New issue: `parseValue` returns bare `strconv` errors with no `%w` wrap and no CLI-context, so a typo like `-type int32 -value foo` surfaces as `strconv.ParseInt: parsing "foo": invalid syntax` without naming the flag — out of line with the GoStyleGuide "wrap errors with useful context using `%w`" rule (Client.Go-017). |
| 5 | Security | No issues found — TLS-by-default with TLS 1.2 floor, API-key redaction in CLI JSON output, no secret logging. |
| 6 | Performance & resource management | No issues found — `defer client.Close()` / `defer subscription.Close()` applied consistently; bench-read-bulk preallocates the latency slice. |
| 7 | Design-document adherence | No new issues. Lazy `grpc.NewClient` + readiness probe (Client.Go-005) and the shared `dial` helper (Client.Go-009) are applied uniformly across `Dial` and `DialGalaxy`. |
| 8 | Code organization & conventions | Client.Go-015 resolved. New issue: `runStreamEvents` does not install a signal handler (Ctrl+C kills the process abruptly), while `runGalaxyWatch` does — the two long-running stream commands have divergent shutdown UX (Client.Go-020). |
| 9 | Testing coverage | Client.Go-008/016 resolved. New issue: the six new bulk and bench subcommands (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`) have no CLI-level unit tests — in particular the Client.Go-015 secured-flag-gating fix has no regression test (Client.Go-021). |
| 10 | Documentation & comments | Client.Go-010/012 resolved. New issue: `runGalaxyWatch` parses `-last-seen-deploy-time` with `time.RFC3339` (no fractional seconds), while `parseRfc3339Timestamp` for `-timestamp-value` accepts `time.RFC3339Nano` — the CLI advertises "RFC 3339" for both but quietly differs on sub-second support (Client.Go-019). |
## Findings
@@ -278,3 +281,130 @@ gRPC's generated `Recv()` does return the `io.EOF` sentinel directly today, so t
**Recommendation:** Either delete the unused `watchSendInterval` field and its branch in `WatchDeployEvents`, or add the test it was added for — e.g. one that pumps more than 16 events with a small interval and asserts the consumer keeps up without losing or reordering events. Linking the field to a `// for TestX` comment if it stays would also help.
**Resolution:** 2026-05-20 — Removed the unused `watchSendInterval` field from `fakeGalaxyServer` and the corresponding `if s.watchSendInterval > 0 { ... }` branch in `WatchDeployEvents`; no test set the field, so the dead code path is gone and the fake is leaner. `gofmt -w` reflowed the struct to drop the no-longer-needed field-name padding.
### Client.Go-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | `clients/go/cmd/mxgw-go/main.go:954-991` |
| Status | Resolved |
**Description:** `parseValue` returns the raw `strconv.ParseBool` / `strconv.ParseInt` / `strconv.ParseFloat` error verbatim — no wrap with `%w` and no indication of which CLI flag was the source. A user running `mxgw-go write -type int32 -value foo` sees
```
strconv.ParseInt: parsing "foo": invalid syntax
```
with no mention of `-value`, `-type`, or which subcommand failed. The same pattern hits every typed branch (bool, int32, int64, float, double). Compare with the sibling helpers in the same file: `parseInt32List` wraps with `"invalid item handle %q: %w"` (Client.Go-003 resolution) and `parseRfc3339Timestamp` wraps with `"invalid RFC 3339 timestamp %q: %w"`. `parseValue` was missed and is inconsistent with those two. The GoStyleGuide (`docs/style-guides/GoStyleGuide.md`, "Errors" section) requires "Wrap errors with useful context using `%w`."
**Recommendation:** Wrap each `strconv` error with the offending input and type, e.g. `return nil, fmt.Errorf("invalid %s value %q: %w", valueType, valueText, err)`. The wrapper handles all five typed branches uniformly without a per-branch change.
**Resolution:** 2026-05-20 — Each typed branch of `parseValue` now wraps the bare `strconv` error with `%w` and names the offending flag and value (`"invalid -value for -type %s: %q: %w"`), so `mxgw-go write -type int32 -value foo` surfaces the source flag, the requested type, and the bad token while still letting `errors.Is/As` reach the underlying `strconv` sentinel. The new `TestParseValueWrapsStrconvErrorWithFlagContext` table-test pins all five typed branches (bool, int32, int64, float, double) to the new wrapper shape.
### Client.Go-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | `clients/go/cmd/mxgw-go/main.go:593-623` |
| Status | Resolved |
**Description:** `runBenchReadBulk`'s warm-up and steady-state loops are wall-clock-only:
```go
for time.Now().Before(warmupDeadline) {
_, _ = session.ReadBulk(ctx, serverHandle, tags, timeout)
}
...
for time.Now().Before(steadyDeadline) {
callStart := time.Now()
results, err := session.ReadBulk(ctx, serverHandle, tags, timeout)
...
}
```
Neither loop checks `ctx.Done()` / `ctx.Err()`. If the parent context is cancelled (e.g. the operator Ctrl+Cs the benchmark, or the cross-language bench driver `scripts/bench-read-bulk.ps1` times out and kills the child early), the loops keep iterating until their wall-clock deadlines elapse. Each `ReadBulk` call inside fails fast (the gRPC call inherits the cancelled context and returns `context.Canceled`), but the steady-state loop counts those as `failedCalls++` and keeps spinning — wasting CPU and inflating the `failedCalls` and `latencyMs.max` figures the PowerShell driver collates across all five clients. The .NET, Rust, Python, and Java bench drivers should be checked for the same shape, but the Go one is the only one being reviewed here. Note that `runBenchReadBulk` is the only Go CLI command that does NOT register its own signal handler (compare with `runGalaxyWatch` which does via `signal.NotifyContext`).
**Recommendation:** Drop out of both loops as soon as `ctx.Err() != nil`. Concretely, change the loop conditions to `for time.Now().Before(warmupDeadline) && ctx.Err() == nil` (and the same on `steadyDeadline`), or use a `select { case <-ctx.Done(): break loop; default: }` guard at the top of each iteration. The cross-language bench shape (`durationMs`, `totalCalls`, `failedCalls`, `latencyMs`) stays the same — the bench just exits sooner and reports the truncated window faithfully.
**Resolution:** 2026-05-20 — Both the warm-up and steady-state loops in `runBenchReadBulk` now carry an `&& ctx.Err() == nil` guard alongside the wall-clock check, so a cancelled parent context (Ctrl+C, or the cross-language bench driver killing the child early) breaks the loop instead of spinning failing `ReadBulk` calls until the deadline elapses. The cross-language bench JSON shape is unchanged — the truncated window is just reported faithfully via `durationMs` / `totalCalls`.
### Client.Go-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | `clients/go/cmd/mxgw-go/main.go:710-716`, `clients/go/cmd/mxgw-go/main.go:1204,1213` |
| Status | Resolved |
**Description:** The CLI advertises two timestamp flags as "RFC3339" but parses them with different layouts:
- `-timestamp-value` (write2/write-secured2 bulk): `parseRfc3339Timestamp` uses `time.RFC3339Nano`, which accepts both `2026-04-28T10:00:00Z` and `2026-04-28T10:00:00.123456789Z`.
- `-last-seen-deploy-time` (galaxy-watch): `time.Parse(time.RFC3339, ...)`, which rejects fractional seconds.
A user copy-pasting an `ObservedAt` timestamp from `galaxy-watch -json` (which is emitted as `RFC3339Nano` by `formatDeployEvent`) directly into `-last-seen-deploy-time` will get a parse error if the source value carried a fractional component, even though both flag descriptions say "RFC3339". The flag help string at `main.go:1204` literally says "RFC3339 timestamp", and the README example uses `2026-04-28T10:00:00Z` (whole seconds only), so the issue is silent until a fractional timestamp comes from the gateway.
**Recommendation:** Switch the `galaxy-watch` parse to `time.RFC3339Nano` to match `parseRfc3339Timestamp` (and the gateway's own emit format). One line change at `main.go:1213`. While there, update the flag help string and the README example to say "RFC 3339 (with optional fractional seconds)" so the two flags are documented uniformly.
**Resolution:** 2026-05-20 — `runGalaxyWatch` now parses `-last-seen-deploy-time` with `time.RFC3339Nano`, matching `parseRfc3339Timestamp` and the gateway's own `formatDeployEvent` emit format; the layout is strictly broader than the previous `time.RFC3339` (whole-second values still parse). The flag help string changed to "RFC 3339 timestamp (with optional fractional seconds)" and the `clients/go/README.md` example was extended with an explicit fractional-seconds line so the two flags advertise the same surface.
### Client.Go-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | `clients/go/cmd/mxgw-go/main.go:753-802`, `clients/go/cmd/mxgw-go/main.go:1199-1275` |
| Status | Resolved |
**Description:** The two long-running stream commands have divergent Ctrl+C UX:
- `runGalaxyWatch` registers a signal handler:
```go
signalCtx, stopSignals := signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM)
defer stopSignals()
streamCtx, cancelStream := context.WithCancel(signalCtx)
```
so Ctrl+C drains buffered events and returns cleanly.
- `runStreamEvents` does not register any signal handler — its parent context is `context.Background()` from `runWithIO`, so Ctrl+C abruptly kills the process. The deferred `subscription.Close()` and `client.Close()` never run, leaving the server-side stream to fault out on a torn TCP connection rather than a clean cancel.
The two commands are otherwise structurally identical (subscribe + loop until limit or external stop) — the inconsistency is one half of a pair that was missed when `galaxy-watch` was added. Worth flagging because it directly affects what an integrator who Ctrl+Cs `stream-events` sees in the gateway's logs (a transport reset rather than a `codes.Canceled`).
**Recommendation:** Mirror the `runGalaxyWatch` pattern in `runStreamEvents`: wrap `ctx` in `signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM)`, derive `streamCtx` from it, and let `defer subscription.Close()` / `defer cancelStream()` tear the stream down on signal. The change is roughly six lines and brings the two stream commands into parity. Optionally factor a shared `withSignals(ctx) (context.Context, context.CancelFunc)` helper if a third stream command lands.
**Resolution:** 2026-05-20 — `runStreamEvents` now installs `signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM)` (with a deferred `stopSignals()`) and derives `streamCtx` from the resulting signal-aware context, mirroring `runGalaxyWatch`. Ctrl+C now cancels the gRPC stream cleanly — the gateway sees `codes.Canceled` instead of a torn TCP connection — and the deferred `subscription.Close()` / `client.Close()` actually run on signal. The two long-running stream commands now share the same shutdown UX.
### Client.Go-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | `clients/go/cmd/mxgw-go/main_test.go`, `clients/go/cmd/mxgw-go/main.go:363-520,522-655` |
| Status | Resolved |
**Description:** The six bulk / bench subcommands wired into `run` (`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, `bench-read-bulk`) have **no CLI-level unit tests** in `main_test.go`. In particular, the Client.Go-015 resolution claims:
> `-current-user-id` / `-verifier-user-id` are only registered for the secured variants and `-user-id` only for Write/Write2, so a wrong-variant flag now fails with a clean `flag provided but not defined` usage error instead of silently no-op'ing.
But there is no test asserting that, e.g., `mxgw-go write-bulk -current-user-id 1 ...` returns a "flag provided but not defined" error, or that `mxgw-go write-secured-bulk -user-id 1 ...` does the same. A future refactor of `runWriteBulkVariant` (notably one that re-introduced the `secured` parameter) could silently re-permit the wrong flags without breaking any test. The same gap applies to: parameter validation in `runReadBulk` (bulk size, empty session/items rejection), the value-count vs handle-count mismatch error in `runWriteBulkVariant:447`, and `runBenchReadBulk`'s `bulk-size`/`duration-seconds` positivity checks.
`mxgateway/client_session_test.go` already covers the library-level happy paths (`TestWriteBulkBuildsOneBulkCommandAndReturnsPerEntryResults`, `TestReadBulkForwardsTimeoutAndUnpacksCachedFlag`, `TestSubscribeBulkBuildsOneBulkCommandAndReturnsResults`), so this finding is about CLI surface area only.
**Recommendation:** Add table-driven tests in `cmd/mxgw-go/main_test.go` along the existing `TestParseInt32List*` and `TestParseValueBuildsTypedValue` style:
- `TestRunWriteBulkVariantGatesSecuredFlags`: invoke `runWithIO` with `write-bulk -current-user-id 1 ...` and `write-secured-bulk -user-id 1 ...`, assert each returns an error matching `flag provided but not defined`.
- `TestRunReadBulkRejectsMissingArgs`: invoke `runWithIO` with `read-bulk` (no `-session-id`), assert the documented "session-id and items are required" error.
- `TestRunBenchReadBulkRejectsNonPositiveBulkSize` / `TestRunBenchReadBulkRejectsNonPositiveDuration`: pin the positivity checks at `main.go:544-549`.
- `TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues`: pin the `len(handles) != len(valueTexts)` error at `main.go:447`.
Each is a few lines and routes through the existing `runWithIO` entry point, so it does not need a bufconn fake.
**Resolution:** 2026-05-20 — Added CLI-level table-driven regression tests in `cmd/mxgw-go/main_test.go` routed through `runWithIO`, so they need no bufconn fake: `TestRunWriteBulkVariantGatesSecuredFlags` pins Client.Go-015 by asserting `write-bulk -current-user-id`, `write-bulk -verifier-user-id`, `write2-bulk -current-user-id`, `write-secured-bulk -user-id`, and `write-secured2-bulk -user-id` all surface `flag provided but not defined`; `TestRunReadBulkRejectsMissingArgs` pins the "session-id and items are required" check across no-flags / missing-items / missing-session-id; `TestRunBenchReadBulkRejectsNonPositiveBulkSize` and `TestRunBenchReadBulkRejectsNonPositiveDuration` pin the positivity checks; `TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues` pins the explicit `item-handles count ... does not match values count ...` error. `go test ./...` passes.