# Code Review — Client.Go | Field | Value | |---|---| | Module | `clients/go` | | Reviewer | Claude Code | | Review date | 2026-05-18 | | Commit reviewed | `3cc53a8` | | Status | Reviewed | | Open findings | 0 | ## Checklist coverage | # | Category | Result | |---|---|---| | 1 | Correctness & logic bugs | Issues found: a typed-nil `Unwrap`/`errors.As` trap (Client.Go-001), a CLI `panic` on malformed input (Client.Go-003), empty-string correlation id on rand failure (Client.Go-007). | | 2 | mxaccessgw conventions | Generally good; two test files fail `gofmt`, breaking the documented workflow (Client.Go-004). | | 3 | Concurrency & thread safety | No issues found — stream goroutines and cancellation are sound. | | 4 | Error handling & resilience | Issues found: the compatibility event path silently drops events (Client.Go-002); no transient/permanent classification (Client.Go-006). | | 5 | Security | No issues found — TLS by default with a TLS 1.2 floor, API key redaction, no secret logging. | | 6 | Performance & resource management | No issues found — connections/streams closed via deferred `Close`/`cancel`. | | 7 | Design-document adherence | Issues found: deprecated `grpc.DialContext`+`WithBlock` usage and a missing error taxonomy (Client.Go-005, Client.Go-006). | | 8 | Code organization & conventions | Issue found: duplication between `Client` and `GalaxyClient` (Client.Go-009). | | 9 | Testing coverage | Issue found: TLS path, `callContext` deadline logic, and `NativeValue`/`NativeArray` edges untested (Client.Go-008). | | 10 | Documentation & comments | Issue found: a stale `WithBlock` dial-cancellation claim (Client.Go-010). | ## Findings ### Client.Go-001 | Field | Value | |---|---| | Severity | High | | Category | Correctness & logic bugs | | Location | `clients/go/mxgateway/errors.go:88-93`, `clients/go/mxgateway/errors.go:117-128` | | Status | Resolved | **Description:** `MxAccessError.Unwrap` returns `e.Command` directly. `EnsureMxAccessSuccess` constructs `&MxAccessError{Reply: reply}` with `Command` left nil (the HRESULT / failing-`MxStatusProxy` path). When `Command` is a nil `*CommandError`, `Unwrap()` returns a non-nil `error` interface wrapping a nil pointer. Consequently `errors.As(err, &ce)` for `*CommandError` returns `true` while setting `ce` to nil — a caller writing the idiomatic `if errors.As(err, &commandErr) { use commandErr.Status }` nil-dereferences and panics. Verified empirically; the existing test only exercises the populated-`Command` path. **Recommendation:** Make `Unwrap` return an untyped nil when `Command` is nil: `if e == nil || e.Command == nil { return nil }; return e.Command`. Add a test for the HRESULT-only `MxAccessError` asserting `errors.As(err, &ce)` is `false`. **Resolution:** Resolved 2026-05-18: `MxAccessError.Unwrap` now returns an untyped nil when `Command` is nil, so `errors.As` no longer binds a typed-nil `*CommandError`; added `errors_test.go` regression coverage for the HRESULT-only and populated-`Command` paths. ### Client.Go-002 | Field | Value | |---|---| | Severity | Medium | | Category | Error handling & resilience | | Location | `clients/go/mxgateway/session.go:440-516` | | Status | Resolved | **Description:** For the `Events`/`EventsAfter` compatibility API (`cancelWhenResultBufferFull == true`), when the 16-slot `results` channel is full `sendEventResult` cancels and returns `false`; the goroutine returns and `close(results)` runs — the consumer sees the channel close with **no `EventResult{Err: ...}` ever delivered**. A slow consumer cannot distinguish "stream ended normally" from "events were silently dropped." This contradicts the design doc's "libraries should not reorder, coalesce, or drop events by default", and a test currently pins this lossy behaviour. **Recommendation:** Before cancelling on a full buffer, deliver a terminal `EventResult` carrying an explicit error (e.g. `ErrEventBufferOverflow`). Document the behaviour on `Session.Events`; steer callers to `SubscribeEvents` (which blocks instead of dropping). **Resolution:** Resolved 2026-05-18: confirmed against source — on a full bounded buffer the compatibility path cancelled and closed `results` with no terminal result. Added the exported sentinel `ErrEventBufferOverflow` (`errors.go`); `sendEventResult` now, on a full buffer, cancels the stream then calls the new `deliverTerminalResult` helper, which evicts one of the oldest buffered events to make room and places `EventResult{Err: ErrEventBufferOverflow}` so it becomes the consumer's last item before the channel closes. The previously lossy regression test (`TestEventsAfterCancelsStreamWhenCompatibilityChannelIsAbandoned`) was re-pointed to assert the terminal `ErrEventBufferOverflow` result is delivered. `clients/go/README.md` now documents the bounded-buffer/overflow behaviour and steers no-loss callers to `SubscribeEvents`. ### Client.Go-003 | Field | Value | |---|---| | Severity | Medium | | Category | Correctness & logic bugs | | Location | `clients/go/cmd/mxgw-go/main.go:517-532` | | Status | Resolved | **Description:** `parseInt32List` calls `panic(err)` when an `item-handles` token fails to parse as an int32. The CLI is a documented user-facing tool; a typo like `-item-handles 1,foo` crashes the process with an unrecovered panic and stack trace instead of returning a clean error and exit code 2 like every other validation path in `main.go`. **Recommendation:** Change `parseInt32List` to return `([]int32, error)` and have `runUnsubscribeBulk` propagate the error, matching `parseValue`'s pattern. **Resolution:** Resolved 2026-05-18: confirmed against source — `parseInt32List` called `panic(err)` on a malformed token. It now returns `([]int32, error)`, wrapping the bad token (`invalid item handle %q: %w`); `runUnsubscribeBulk` parses item handles before dialing and returns the error, so a typo flows through `runWithIO` to `os.Exit(2)` like other validation paths. Regression tests `TestParseInt32ListParsesValidTokens` and `TestParseInt32ListReturnsErrorOnMalformedToken` added to `cmd/mxgw-go/main_test.go`. ### Client.Go-004 | Field | Value | |---|---| | Severity | Low | | Category | mxaccessgw conventions | | Location | `clients/go/mxgateway/alarms_test.go:153-154`, `clients/go/mxgateway/galaxy_test.go:58-59` | | Status | Resolved | **Description:** `gofmt -l` flags `alarms_test.go` and `galaxy_test.go` for misaligned struct-literal field padding. The Go client README lists `gofmt` as part of the workflow and the repo enforces style; unformatted committed code breaks `gofmt`-gated checks and CI. **Recommendation:** Run `gofmt -w mxgateway/alarms_test.go mxgateway/galaxy_test.go`. **Resolution:** Resolved 2026-05-18: confirmed `gofmt -l .` flagged both files for misaligned struct-literal padding. Ran `gofmt -w` on `mxgateway/alarms_test.go` and `mxgateway/galaxy_test.go`; `gofmt -l .` is now clean for the whole module. ### Client.Go-005 | Field | Value | |---|---| | Severity | Low | | Category | Design-document adherence | | Location | `clients/go/mxgateway/client.go:64,68`, `clients/go/mxgateway/galaxy.go:83,87` | | Status | Resolved | **Description:** The client uses `grpc.DialContext` with `grpc.WithBlock()`. In current grpc-go both are deprecated in favour of `grpc.NewClient` (lazy connection). `WithBlock` also changes failure semantics: a transient gateway-unavailable at dial time becomes a hard `Dial` error rather than a connection that recovers when the gateway comes up, working against the design doc's resilience intent. **Recommendation:** Migrate to `grpc.NewClient`; if a fail-fast connect probe is still wanted, do an explicit readiness wait bounded by `DialTimeout`, and update the doc comment. **Resolution:** Resolved 2026-05-18: confirmed `Dial`/`DialGalaxy` used the deprecated `grpc.DialContext` + `grpc.WithBlock` pair. Migrated both to the shared `dial(ctx, opts)` helper, which now builds a lazy connection with `grpc.NewClient` and runs an explicit `waitForReady` readiness probe (`Connect` + `WaitForStateChange` until `connectivity.Ready`) bounded by `DialTimeout` — preserving fail-fast behavior while letting an otherwise lazy connection recover when the gateway is briefly down. Note: `grpc.NewClient` defaults the target scheme to `dns`, so the bufconn test harnesses (`client_session_test.go`, `alarms_test.go`, `galaxy_test.go`) were updated to use `passthrough:///bufnet` so the fake target reaches the context dialer. New tests `TestDialFailsFastWhenGatewayUnreachable` and `TestDialReadinessProbeReachesReady` cover the probe; `go vet` reports no deprecation. `clients/go/README.md` documents the lazy-connect + readiness-probe semantics. ### Client.Go-006 | Field | Value | |---|---| | Severity | Low | | Category | Error handling & resilience | | Location | `clients/go/mxgateway/errors.go:9-130` | | Status | Resolved | **Description:** `docs/ClientLibrariesDesign.md` recommends a high-level error taxonomy (`TransportError`, `AuthenticationError`, `TimeoutError`, etc.). The Go client collapses all transport/gRPC failures into a single `GatewayError` with no way to classify transient (`Unavailable`, `DeadlineExceeded`) vs permanent (`Unauthenticated`, `InvalidArgument`) without manually unwrapping and calling `status.Code`. **Recommendation:** Add a helper (e.g. `IsTransient(err) bool`) or expose the gRPC `codes.Code` on `GatewayError`, so retry/timeout/auth handling can be written without re-parsing the wrapped error. **Resolution:** Resolved 2026-05-18: implemented the recommended classification surface in `errors.go` rather than a full parallel type hierarchy (the existing `GatewayError`/`CommandError`/`MxAccessError` chain already separates transport from protocol from MXAccess failures). Added `GatewayError.Code()` (returns the wrapped gRPC `codes.Code`, `OK` for nil, `Unknown` for a non-status error) and the free function `IsTransient(err error) bool`, which unwraps through `*GatewayError` and any gRPC-status chain and reports `true` for `Unavailable`, `DeadlineExceeded`, `ResourceExhausted`, and `Aborted`. Tests `TestGatewayErrorCode` and `TestIsTransient` cover the matrix; `clients/go/README.md` documents both for retry/timeout/auth handling. ### Client.Go-007 | Field | Value | |---|---| | Severity | Low | | Category | Correctness & logic bugs | | Location | `clients/go/mxgateway/session.go:526-532` | | Status | Resolved | **Description:** `newCorrelationID` returns an empty string when `crypto/rand.Read` fails, silently producing an `MxCommandRequest` with no correlation id. `rand.Read` failure is rare, but the failure mode (untraceable command, no error surfaced) is worse than failing loud, and the empty-id path is untested. **Recommendation:** Either propagate the error up through `invokeCommand`, or fall back to a time/counter-based id rather than an empty string. **Resolution:** Resolved 2026-05-18: confirmed `newCorrelationID` returned `""` on a `rand.Read` failure. It now falls back to a non-empty `"fallback--"` id built from `time.Now().UnixNano()` and a process-wide `atomic.Uint64` monotonic counter, so every command stays traceable even without entropy. The `crypto/rand` call was routed through a `randRead` package variable so the failure path is testable; `TestNewCorrelationIDFallsBackOnRandFailure` simulates a `rand.Read` failure and asserts the fallback id is non-empty, `fallback-` prefixed, and unique, and `TestNewCorrelationIDUsesRandEntropy` pins the happy path. ### Client.Go-008 | Field | Value | |---|---| | Severity | Low | | Category | Testing coverage | | Location | `clients/go/mxgateway/` (test files) | | Status | Resolved | **Description:** Several critical paths are untested: TLS credential resolution in `resolveTransportCredentials` (only the `Plaintext` path is exercised); the `callContext` deadline-shortening logic (`client.go:198-204`) including the negative-timeout disable case; and `NativeValue`/`NativeArray` for the array, raw-bytes, null, and unsupported-kind branches. **Recommendation:** Add unit tests for `resolveTransportCredentials` precedence, `callContext` deadline arithmetic, and `NativeValue`/`NativeArray` round-trips for every kind. **Resolution:** Resolved 2026-05-18: added `clients/go/mxgateway/coverage_test.go`. `TestResolveTransportCredentialsPrecedence` exercises every branch (explicit `TransportCredentials`, `Plaintext`, missing `CACertFile` error, `TLSConfig` + `ServerNameOverride`, default TLS floor) and `TestResolveTransportCredentialsDoesNotMutateTLSConfig` confirms the supplied `*tls.Config` is cloned. `TestCallContextDeadlineArithmetic` covers zero/default, negative-disable, positive timeout, caller-deadline-sooner-kept, and caller-deadline-later-shortened. `TestNativeValueEdgeKinds`, `TestNativeArrayEdgeKinds`, and `TestNativeValueUnsupportedKind` cover the null, raw-bytes (including the no-alias copy), array, timestamp-with-nil, and unsupported-kind branches. ### Client.Go-009 | Field | Value | |---|---| | Severity | Low | | Category | Code organization & conventions | | Location | `clients/go/mxgateway/galaxy.go:60-93,241-256`, `clients/go/mxgateway/client.go:41-74,190-205` | | Status | Resolved | **Description:** `DialGalaxy`/`Dial` and `GalaxyClient.callContext`/`Client.callContext` are near-identical duplicates (dial-context setup, credential resolution, dial-option assembly, deadline arithmetic). A fix to one (e.g. the Client.Go-005 dial migration) must be applied twice and can drift. **Recommendation:** Extract a shared unexported `dial(ctx, opts)` and a free `callContext(opts, ctx)` function, and have both client constructors call them. **Resolution:** Resolved 2026-05-18: extracted the shared unexported `dial(ctx, opts) (*grpc.ClientConn, error)` (credential resolution, dial-option assembly, `grpc.NewClient`, readiness probe) and the free `callContext(ctx, callTimeout) (context.Context, context.CancelFunc)` into `client.go`. `Dial`/`DialGalaxy` and both `(*Client).callContext`/`(*GalaxyClient).callContext` methods now delegate to them; the duplicated dial and deadline code in `galaxy.go` was removed (its now-unused `errors` import dropped). This was done together with the Client.Go-005 migration so the `grpc.NewClient` change lives in exactly one place. ### Client.Go-010 | Field | Value | |---|---| | Severity | Low | | Category | Documentation & comments | | Location | `clients/go/mxgateway/client.go:39-40` | | Status | Resolved | **Description:** The `Dial` doc comment states it configures "blocking dial cancellation from ctx." This describes the deprecated `WithBlock` behaviour; once Client.Go-005 is addressed the comment is misleading about how connection establishment and cancellation work. **Recommendation:** Reword to describe the actual connect/timeout semantics after resolving Client.Go-005, and clarify that `DialTimeout` bounds the initial connect attempt. **Resolution:** Resolved 2026-05-18: alongside the Client.Go-005 migration, the `Dial` doc comment was rewritten to describe the lazy `grpc.NewClient` connection, the `DialTimeout`-bounded (default 10s, or ctx deadline when sooner) readiness probe, that a briefly-unavailable gateway recovers instead of producing a hard error, and that cancelling `ctx` aborts the probe. `DialGalaxy` and the new `dial`/`waitForReady`/`callContext` helpers carry matching doc comments.