Restores the `code-reviews/` tree (was unwritten on this working copy)
and re-reviews every module per `REVIEW-PROCESS.md` against HEAD
`d692232`. The diff in scope is the five commits since the last sweep:
`dc9c0c9` (ZB.MOM.WW gateway-side rename + slnx migrate),
`397d3c5` (client SDK rename + the missing alarm-RPC proto types and
the .NET DiscoverHierarchyOptions POCO), `27ed651` (role-based LDAP
auth + HubToken bearer, drop PathBase), `6594359` (sidebar layout +
three SignalR push hubs), and `d692232` (EventsHub publisher + doc
refresh).
Module status
| Module | Open | Total | Delta this pass |
|---|---|---|---|
| Server | 8 | 43 | +6 |
| Contracts | 2 | 17 | +2 |
| Tests | 2 | 26 | +2 |
| IntegrationTests | 3 | 24 | +3 |
| Client.Java | 5 | 31 | +5 |
| Client.Rust | 1 | 21 | +1 |
| Worker | 0 | 25 | 0 (rename-only diff, clean) |
| Worker.Tests | 0 | 30 | 0 (rename-only diff, clean) |
| Client.Dotnet | 0 | 17 | 0 (rename + alarm-fix diff, clean) |
| Client.Python | 0 | 21 | 0 (rename + alarm-fix diff, clean) |
| Client.Go | 0 | 21 | 0 (rename + alarm-fix diff, clean) |
Total new findings: 19. Severity breakdown: 1 Medium-security
(Server-038), 4 Medium-documentation/coverage, 14 Low.
New findings
* Server-038 (Medium / Security) — EventsHub.SubscribeSession accepts
any session id from any Viewer; no per-session ACL guards the
EventsHub group fan-out.
* Server-039 (Low / Error handling) — HubTokenService.Validate
accepts a payload with null Name/NameIdentifier.
* Server-040 (Low / Conventions) — MapGroupsToRoles undocumented
full-vs-RDN lookup precedence.
* Server-041 (Low / Design adherence) — EventStreamService calls
IDashboardEventBroadcaster.Publish without a try/catch — fragile
seam relying on the never-throw contract.
* Server-042 (Low / Performance) — DashboardSnapshotPublisher tight
retry loop with no backoff (vs AlarmsHubPublisher 5s delay).
* Server-043 (Low / Documentation) — HubTokenService singleton
sharing across login + hub-token validation undocumented.
* Contracts-016 (Low / Conventions) — QueryActiveAlarmsRequest.session_id
reserved-for-future-use ambiguity.
* Contracts-017 (Low / Documentation) — rpc QueryActiveAlarms doc
omits the alarm_filter_prefix filter description.
* Tests-025 (Low / Conventions) — duplicate NullDashboardEventBroadcaster
fakes in EventStreamServiceTests and GatewayEndToEndFakeWorkerSmokeTests.
* Tests-026 (Medium / Testing coverage) — no test proves
EventStreamService actually calls IDashboardEventBroadcaster.Publish.
* IntegrationTests-022 (Low / Conventions) — ResolveRepositoryRoot
silent fallback to Directory.GetCurrentDirectory().
* IntegrationTests-023 (Low / Testing coverage) — DashboardLdapLiveTests
success-path asserts ldap_group but not the Role claim.
* IntegrationTests-024 (Low / Conventions) — inline
NullDashboardEventBroadcaster fake duplicates Tests-side copies.
* Client.Java-027 (Medium / Documentation) — README + JavaClientDesign
Gradle task names still use the old short project names.
* Client.Java-028 (Medium / Design adherence) — JavaClientDesign
build-layout shows the old `com/dohertylan/mxgateway/` package paths.
* Client.Java-029 (Low / Documentation) — README installDist path
cites the wrong directory.
* Client.Java-030 (Low / Testing coverage) — no Java test exercises
the regenerated QueryActiveAlarmsRequest RPC.
* Client.Java-031 (Low / Conventions) — README prose uses old short
project names instead of canonical prefixed ones.
* Client.Rust-021 (Low / Design adherence) — RustClientDesign.md
"Crate layout" shows an aspirational nested `crates/zb-mom-ww-mxgateway-client/`
that does not exist; actual layout is the flat top-level crate.
Two pre-existing pending findings (Server-031 lock-contention,
Server-032 bounded event channel) remain unchanged — neither was
touched by this wave of commits.
Process notes
- The `code-reviews/` tree was not in this working copy's git
history (the local extract pre-dates the divergent branch that
carried the reviews). Restored from `dd7ca16` via
`git checkout dd7ca16 -- code-reviews/` before the re-review.
- Some "Resolved" entries in the restored findings.md reference
fixes that landed on the divergent branch (the same one that
carried the reviews) and are not present on the current main
lineage. The re-review treats those statuses as historical;
the new pass only files findings against HEAD's actual state.
- `python code-reviews/regen-readme.py --check` is green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
40 KiB
Code Review — Client.Go
| Field | Value |
|---|---|
| Module | clients/go |
| Reviewer | Claude Code |
| Review date | 2026-05-24 |
| Commit reviewed | d692232 |
| Status | Reviewed |
| Open findings | 0 |
Checklist coverage
A re-review of commit a020350 (which resolved Client.Go-011..016). gofmt -l .,
go vet ./..., go build ./..., and go test ./... -count=1 are all clean.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | Prior Client.Go-001/003/007/011 remain resolved. No new correctness bugs found. |
| 2 | mxaccessgw conventions | gofmt -l . and go vet ./... clean; Client.Go-004 stays resolved. No new issues. |
| 3 | Concurrency & thread safety | Client.Go-013 resolved. New issue: runBenchReadBulk's warm-up + steady-state wall-clock loops ignore ctx cancellation, so a Ctrl+C or parent-cancel keeps spinning ReadBulk calls until the wall-clock deadline (Client.Go-018). |
| 4 | Error handling & resilience | Client.Go-014 resolved. New issue: parseValue returns bare strconv errors with no %w wrap and no CLI-context, so a typo like -type int32 -value foo surfaces as strconv.ParseInt: parsing "foo": invalid syntax without naming the flag — out of line with the GoStyleGuide "wrap errors with useful context using %w" rule (Client.Go-017). |
| 5 | Security | No issues found — TLS-by-default with TLS 1.2 floor, API-key redaction in CLI JSON output, no secret logging. |
| 6 | Performance & resource management | No issues found — defer client.Close() / defer subscription.Close() applied consistently; bench-read-bulk preallocates the latency slice. |
| 7 | Design-document adherence | No new issues. Lazy grpc.NewClient + readiness probe (Client.Go-005) and the shared dial helper (Client.Go-009) are applied uniformly across Dial and DialGalaxy. |
| 8 | Code organization & conventions | Client.Go-015 resolved. New issue: runStreamEvents does not install a signal handler (Ctrl+C kills the process abruptly), while runGalaxyWatch does — the two long-running stream commands have divergent shutdown UX (Client.Go-020). |
| 9 | Testing coverage | Client.Go-008/016 resolved. New issue: the six new bulk and bench subcommands (read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk) have no CLI-level unit tests — in particular the Client.Go-015 secured-flag-gating fix has no regression test (Client.Go-021). |
| 10 | Documentation & comments | Client.Go-010/012 resolved. New issue: runGalaxyWatch parses -last-seen-deploy-time with time.RFC3339 (no fractional seconds), while parseRfc3339Timestamp for -timestamp-value accepts time.RFC3339Nano — the CLI advertises "RFC 3339" for both but quietly differs on sub-second support (Client.Go-019). |
2026-05-24 review (commit d692232)
Re-review pass at d692232. Two commits touched clients/go/: 397d3c5
updated the proto root in generate-proto.ps1 to src/ZB.MOM.WW.MxGateway.Contracts/Protos;
d692232 dropped stale SessionId = lines from AcknowledgeAlarmRequest /
AcknowledgeAlarmReply fixtures in alarms_test.go after the proto retired
the field, and substituted CorrelationId: req.GetClientCorrelationId() for
the fake server's reply. Module path (gitea.dohertylan.com/dohertj2/mxaccessgw/clients/go)
and subpackage mxgateway intentionally unchanged per Go convention.
go build ./... and go test ./... are green at HEAD.
| # | Category | Result |
|---|---|---|
| 1 | Correctness & logic bugs | No issues found in the a020350..d692232 diff. |
| 2 | mxaccessgw conventions | No issues found — Go convention preserved (short lowercase package name, gitea-scoped module path). The generated _pb.go descriptor still carries MxGateway.Contracts.Proto in its csharp_namespace bytes; wire-level metadata not used by Go, intentionally not regenerated. |
| 3 | Concurrency & thread safety | No issues found in this diff. |
| 4 | Error handling & resilience | No issues found in this diff. |
| 5 | Security | No issues found in this diff. |
| 6 | Performance & resource management | No issues found in this diff. |
| 7 | Design-document adherence | No issues found in this diff. |
| 8 | Code organization & conventions | No issues found in this diff. |
| 9 | Testing coverage | No issues found in this diff — alarms_test.go correctly drops retired session_id from AcknowledgeAlarmRequest and retains it on QueryActiveAlarmsRequest. |
| 10 | Documentation & comments | No issues found in this diff. |
Findings
Client.Go-001
| Field | Value |
|---|---|
| Severity | High |
| Category | Correctness & logic bugs |
| Location | clients/go/mxgateway/errors.go:88-93, clients/go/mxgateway/errors.go:117-128 |
| Status | Resolved |
Description: MxAccessError.Unwrap returns e.Command directly. EnsureMxAccessSuccess constructs &MxAccessError{Reply: reply} with Command left nil (the HRESULT / failing-MxStatusProxy path). When Command is a nil *CommandError, Unwrap() returns a non-nil error interface wrapping a nil pointer. Consequently errors.As(err, &ce) for *CommandError returns true while setting ce to nil — a caller writing the idiomatic if errors.As(err, &commandErr) { use commandErr.Status } nil-dereferences and panics. Verified empirically; the existing test only exercises the populated-Command path.
Recommendation: Make Unwrap return an untyped nil when Command is nil: if e == nil || e.Command == nil { return nil }; return e.Command. Add a test for the HRESULT-only MxAccessError asserting errors.As(err, &ce) is false.
Resolution: Resolved 2026-05-18: MxAccessError.Unwrap now returns an untyped nil when Command is nil, so errors.As no longer binds a typed-nil *CommandError; added errors_test.go regression coverage for the HRESULT-only and populated-Command paths.
Client.Go-002
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Error handling & resilience |
| Location | clients/go/mxgateway/session.go:440-516 |
| Status | Resolved |
Description: For the Events/EventsAfter compatibility API (cancelWhenResultBufferFull == true), when the 16-slot results channel is full sendEventResult cancels and returns false; the goroutine returns and close(results) runs — the consumer sees the channel close with no EventResult{Err: ...} ever delivered. A slow consumer cannot distinguish "stream ended normally" from "events were silently dropped." This contradicts the design doc's "libraries should not reorder, coalesce, or drop events by default", and a test currently pins this lossy behaviour.
Recommendation: Before cancelling on a full buffer, deliver a terminal EventResult carrying an explicit error (e.g. ErrEventBufferOverflow). Document the behaviour on Session.Events; steer callers to SubscribeEvents (which blocks instead of dropping).
Resolution: Resolved 2026-05-18: confirmed against source — on a full bounded buffer the compatibility path cancelled and closed results with no terminal result. Added the exported sentinel ErrEventBufferOverflow (errors.go); sendEventResult now, on a full buffer, cancels the stream then calls the new deliverTerminalResult helper, which evicts one of the oldest buffered events to make room and places EventResult{Err: ErrEventBufferOverflow} so it becomes the consumer's last item before the channel closes. The previously lossy regression test (TestEventsAfterCancelsStreamWhenCompatibilityChannelIsAbandoned) was re-pointed to assert the terminal ErrEventBufferOverflow result is delivered. clients/go/README.md now documents the bounded-buffer/overflow behaviour and steers no-loss callers to SubscribeEvents.
Client.Go-003
| Field | Value |
|---|---|
| Severity | Medium |
| Category | Correctness & logic bugs |
| Location | clients/go/cmd/mxgw-go/main.go:517-532 |
| Status | Resolved |
Description: parseInt32List calls panic(err) when an item-handles token fails to parse as an int32. The CLI is a documented user-facing tool; a typo like -item-handles 1,foo crashes the process with an unrecovered panic and stack trace instead of returning a clean error and exit code 2 like every other validation path in main.go.
Recommendation: Change parseInt32List to return ([]int32, error) and have runUnsubscribeBulk propagate the error, matching parseValue's pattern.
Resolution: Resolved 2026-05-18: confirmed against source — parseInt32List called panic(err) on a malformed token. It now returns ([]int32, error), wrapping the bad token (invalid item handle %q: %w); runUnsubscribeBulk parses item handles before dialing and returns the error, so a typo flows through runWithIO to os.Exit(2) like other validation paths. Regression tests TestParseInt32ListParsesValidTokens and TestParseInt32ListReturnsErrorOnMalformedToken added to cmd/mxgw-go/main_test.go.
Client.Go-004
| Field | Value |
|---|---|
| Severity | Low |
| Category | mxaccessgw conventions |
| Location | clients/go/mxgateway/alarms_test.go:153-154, clients/go/mxgateway/galaxy_test.go:58-59 |
| Status | Resolved |
Description: gofmt -l flags alarms_test.go and galaxy_test.go for misaligned struct-literal field padding. The Go client README lists gofmt as part of the workflow and the repo enforces style; unformatted committed code breaks gofmt-gated checks and CI.
Recommendation: Run gofmt -w mxgateway/alarms_test.go mxgateway/galaxy_test.go.
Resolution: Resolved 2026-05-18: confirmed gofmt -l . flagged both files for misaligned struct-literal padding. Ran gofmt -w on mxgateway/alarms_test.go and mxgateway/galaxy_test.go; gofmt -l . is now clean for the whole module.
Client.Go-005
| Field | Value |
|---|---|
| Severity | Low |
| Category | Design-document adherence |
| Location | clients/go/mxgateway/client.go:64,68, clients/go/mxgateway/galaxy.go:83,87 |
| Status | Resolved |
Description: The client uses grpc.DialContext with grpc.WithBlock(). In current grpc-go both are deprecated in favour of grpc.NewClient (lazy connection). WithBlock also changes failure semantics: a transient gateway-unavailable at dial time becomes a hard Dial error rather than a connection that recovers when the gateway comes up, working against the design doc's resilience intent.
Recommendation: Migrate to grpc.NewClient; if a fail-fast connect probe is still wanted, do an explicit readiness wait bounded by DialTimeout, and update the doc comment.
Resolution: Resolved 2026-05-18: confirmed Dial/DialGalaxy used the deprecated grpc.DialContext + grpc.WithBlock pair. Migrated both to the shared dial(ctx, opts) helper, which now builds a lazy connection with grpc.NewClient and runs an explicit waitForReady readiness probe (Connect + WaitForStateChange until connectivity.Ready) bounded by DialTimeout — preserving fail-fast behavior while letting an otherwise lazy connection recover when the gateway is briefly down. Note: grpc.NewClient defaults the target scheme to dns, so the bufconn test harnesses (client_session_test.go, alarms_test.go, galaxy_test.go) were updated to use passthrough:///bufnet so the fake target reaches the context dialer. New tests TestDialFailsFastWhenGatewayUnreachable and TestDialReadinessProbeReachesReady cover the probe; go vet reports no deprecation. clients/go/README.md documents the lazy-connect + readiness-probe semantics.
Client.Go-006
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | clients/go/mxgateway/errors.go:9-130 |
| Status | Resolved |
Description: docs/ClientLibrariesDesign.md recommends a high-level error taxonomy (TransportError, AuthenticationError, TimeoutError, etc.). The Go client collapses all transport/gRPC failures into a single GatewayError with no way to classify transient (Unavailable, DeadlineExceeded) vs permanent (Unauthenticated, InvalidArgument) without manually unwrapping and calling status.Code.
Recommendation: Add a helper (e.g. IsTransient(err) bool) or expose the gRPC codes.Code on GatewayError, so retry/timeout/auth handling can be written without re-parsing the wrapped error.
Resolution: Resolved 2026-05-18: implemented the recommended classification surface in errors.go rather than a full parallel type hierarchy (the existing GatewayError/CommandError/MxAccessError chain already separates transport from protocol from MXAccess failures). Added GatewayError.Code() (returns the wrapped gRPC codes.Code, OK for nil, Unknown for a non-status error) and the free function IsTransient(err error) bool, which unwraps through *GatewayError and any gRPC-status chain and reports true for Unavailable, DeadlineExceeded, ResourceExhausted, and Aborted. Tests TestGatewayErrorCode and TestIsTransient cover the matrix; clients/go/README.md documents both for retry/timeout/auth handling.
Client.Go-007
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | clients/go/mxgateway/session.go:526-532 |
| Status | Resolved |
Description: newCorrelationID returns an empty string when crypto/rand.Read fails, silently producing an MxCommandRequest with no correlation id. rand.Read failure is rare, but the failure mode (untraceable command, no error surfaced) is worse than failing loud, and the empty-id path is untested.
Recommendation: Either propagate the error up through invokeCommand, or fall back to a time/counter-based id rather than an empty string.
Resolution: Resolved 2026-05-18: confirmed newCorrelationID returned "" on a rand.Read failure. It now falls back to a non-empty "fallback-<unixnano>-<counter>" id built from time.Now().UnixNano() and a process-wide atomic.Uint64 monotonic counter, so every command stays traceable even without entropy. The crypto/rand call was routed through a randRead package variable so the failure path is testable; TestNewCorrelationIDFallsBackOnRandFailure simulates a rand.Read failure and asserts the fallback id is non-empty, fallback- prefixed, and unique, and TestNewCorrelationIDUsesRandEntropy pins the happy path.
Client.Go-008
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | clients/go/mxgateway/ (test files) |
| Status | Resolved |
Description: Several critical paths are untested: TLS credential resolution in resolveTransportCredentials (only the Plaintext path is exercised); the callContext deadline-shortening logic (client.go:198-204) including the negative-timeout disable case; and NativeValue/NativeArray for the array, raw-bytes, null, and unsupported-kind branches.
Recommendation: Add unit tests for resolveTransportCredentials precedence, callContext deadline arithmetic, and NativeValue/NativeArray round-trips for every kind.
Resolution: Resolved 2026-05-18: added clients/go/mxgateway/coverage_test.go. TestResolveTransportCredentialsPrecedence exercises every branch (explicit TransportCredentials, Plaintext, missing CACertFile error, TLSConfig + ServerNameOverride, default TLS floor) and TestResolveTransportCredentialsDoesNotMutateTLSConfig confirms the supplied *tls.Config is cloned. TestCallContextDeadlineArithmetic covers zero/default, negative-disable, positive timeout, caller-deadline-sooner-kept, and caller-deadline-later-shortened. TestNativeValueEdgeKinds, TestNativeArrayEdgeKinds, and TestNativeValueUnsupportedKind cover the null, raw-bytes (including the no-alias copy), array, timestamp-with-nil, and unsupported-kind branches.
Client.Go-009
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | clients/go/mxgateway/galaxy.go:60-93,241-256, clients/go/mxgateway/client.go:41-74,190-205 |
| Status | Resolved |
Description: DialGalaxy/Dial and GalaxyClient.callContext/Client.callContext are near-identical duplicates (dial-context setup, credential resolution, dial-option assembly, deadline arithmetic). A fix to one (e.g. the Client.Go-005 dial migration) must be applied twice and can drift.
Recommendation: Extract a shared unexported dial(ctx, opts) and a free callContext(opts, ctx) function, and have both client constructors call them.
Resolution: Resolved 2026-05-18: extracted the shared unexported dial(ctx, opts) (*grpc.ClientConn, error) (credential resolution, dial-option assembly, grpc.NewClient, readiness probe) and the free callContext(ctx, callTimeout) (context.Context, context.CancelFunc) into client.go. Dial/DialGalaxy and both (*Client).callContext/(*GalaxyClient).callContext methods now delegate to them; the duplicated dial and deadline code in galaxy.go was removed (its now-unused errors import dropped). This was done together with the Client.Go-005 migration so the grpc.NewClient change lives in exactly one place.
Client.Go-010
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | clients/go/mxgateway/client.go:39-40 |
| Status | Resolved |
Description: The Dial doc comment states it configures "blocking dial cancellation from ctx." This describes the deprecated WithBlock behaviour; once Client.Go-005 is addressed the comment is misleading about how connection establishment and cancellation work.
Recommendation: Reword to describe the actual connect/timeout semantics after resolving Client.Go-005, and clarify that DialTimeout bounds the initial connect attempt.
Resolution: Resolved 2026-05-18: alongside the Client.Go-005 migration, the Dial doc comment was rewritten to describe the lazy grpc.NewClient connection, the DialTimeout-bounded (default 10s, or ctx deadline when sooner) readiness probe, that a briefly-unavailable gateway recovers instead of producing a hard error, and that cancelling ctx aborts the probe. DialGalaxy and the new dial/waitForReady/callContext helpers carry matching doc comments.
Client.Go-011
| Field | Value |
|---|---|
| Severity | Low |
| Category | Correctness & logic bugs |
| Location | clients/go/mxgateway/alarms_test.go:66-73 |
| Status | Resolved |
Description: TestAcknowledgeAlarmRejectsNilRequest contains a no-op if with an empty body whose intent is documented in a comment ("Accept either: the helper returned the literal sentinel, or the generic transport error — both prove nil was rejected"). The condition
if err == nil || !errors.Is(err, errors.Unwrap(err)) && err.Error() != "mxgateway: acknowledge alarm request is required" {
// ...
}
evaluates expressions for side effects only and asserts nothing — Go's && binds tighter than ||, the body is empty, and the actual nil check happens on the very next if err == nil. The block is effectively dead code masquerading as a check. It also evaluates errors.Unwrap(err) regardless of err's shape, and would call err.Error() even when err might be a wrapped status error whose message wording the gateway is free to change — making the apparent assertion brittle on top of being dead.
Recommendation: Drop the empty-body if entirely (the subsequent if err == nil { t.Fatalf(...) } already enforces the contract), or, if the intent is to additionally pin the literal error message for the sentinel path, replace it with a real assertion (if err.Error() != "mxgateway: acknowledge alarm request is required" { t.Fatalf(...) }) and remove the spurious errors.Is(err, errors.Unwrap(err)) clause.
Resolution: 2026-05-20 — Removed the empty-body if in TestAcknowledgeAlarmRejectsNilRequest; the subsequent if err == nil { t.Fatalf(...) } already enforces the nil-rejection contract without the dead, brittle compound predicate.
Client.Go-012
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | clients/go/cmd/mxgw-go/main.go:1063-1065, clients/go/cmd/mxgw-go/main.go:88-104 |
| Status | Resolved |
Description: writeUsage lists the available subcommands as version|open-session|close-session|register|add-item|advise|subscribe-bulk|unsubscribe-bulk|write|stream-events|smoke|galaxy-test-connection|galaxy-last-deploy|galaxy-discover|galaxy-watch. Six subcommands wired into run are missing from this list: read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, and bench-read-bulk. A user invoking mxgw-go with no args or an unknown command (the two paths that print this banner) sees an incomplete CLI surface and may believe the bulk-write / read-bulk families are not implemented. The README does document them, but the inline usage banner is the first source of truth a CLI user consults.
Recommendation: Extend the usage string to include every command registered in the switch args[0] in run, or generate it from a single source-of-truth slice keyed on command name → handler so the two cannot drift again.
Resolution: 2026-05-20 — writeUsage now lists the previously missing read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, and bench-read-bulk subcommands alongside the original surface, so the no-args / unknown-command banner reflects every command wired into run.
Client.Go-013
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | clients/go/cmd/mxgw-go/main.go:1246-1249, clients/go/cmd/mxgw-go/main.go:1257-1262 |
| Status | Resolved |
Description: In runGalaxyWatch, the signal-cancellation branch carefully drains the buffered events channel after cancelStream() so the WatchDeployEvents goroutine can exit (for range events { }). The limit-reached branch (if *limit > 0 && count >= *limit { cancelStream(); return nil }) skips that drain and returns immediately. After the function returns, defer client.Close() runs and tears down the gRPC connection; in the gap before the connection close propagates, the WatchDeployEvents goroutine may still be blocked on case events <- event: (the channel is buffered to 16 but a slow producer can refill it) — the goroutine then exits via <-ctx.Done() because streamCtx was cancelled, so it isn't a permanent leak, but the two cancellation paths behave inconsistently and the limit-reached path can briefly hold a goroutine plus the gRPC stream while the client tears down underneath it.
Recommendation: Factor the drain into a helper and use it from both branches, e.g. after cancelStream() always for range events { } (and let the surrounding select/for re-evaluate <-errs if a terminal error was already buffered). Alternatively, drop the explicit drain in both branches and rely on defer cancelStream() plus defer client.Close() — but pick one model and apply it consistently.
Resolution: 2026-05-20 — The limit-reached branch in runGalaxyWatch now drains the buffered events channel (for range events { }) after cancelStream(), matching the signal-cancel branch. Both cancellation paths now wait for the WatchDeployEvents goroutine to exit before defer client.Close() tears the gRPC connection down.
Client.Go-014
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | clients/go/mxgateway/session.go:602, clients/go/mxgateway/galaxy.go:189 |
| Status | Resolved |
Description: Two stream Recv loops compare end-of-stream with err == io.EOF directly:
session.go:602—if err == io.EOF || status.Code(err) == codes.Canceled || streamCtx.Err() != nil { return }galaxy.go:189—if recvErr == io.EOF { return }
gRPC's generated Recv() does return the io.EOF sentinel directly today, so the comparisons work in practice. However, the Go idiom (and the project's docs/style-guides/GoStyleGuide.md) is to use errors.Is(err, io.EOF) so future wrapping (e.g. an interceptor decorating Recv errors) does not silently flip the loop from "stream finished normally" to "stream produced an error". The mxgateway client itself wraps non-EOF Recv errors in *GatewayError, which errors.Is already supports — using errors.Is keeps both paths consistent.
Recommendation: Replace recvErr == io.EOF / err == io.EOF with errors.Is(err, io.EOF) (the errors package is already imported in both files).
Resolution: 2026-05-20 — Both stream Recv loops now use errors.Is(err, io.EOF): session.go already imported errors, and galaxy.go gained the missing errors import alongside the recvErr == io.EOF → errors.Is(recvErr, io.EOF) change, keeping EOF detection robust against any future Recv-error wrapping.
Client.Go-015
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | clients/go/cmd/mxgw-go/main.go:410-512 |
| Status | Resolved |
Description: runWriteBulkVariant(ctx, args, stdout, stderr, command, withTimestamp, secured bool) accepts secured but never uses it — the routing is keyed on command (the string "write-bulk" / "write2-bulk" / "write-secured-bulk" / "write-secured2-bulk"). The function ends with _ = secured // currently only used for routing above; reserved for future per-variant validation, which is misleading because secured is not in fact used for routing. The four wrapper functions (runWriteBulk, runWrite2Bulk, runWriteSecuredBulk, runWriteSecured2Bulk) all pass a secured argument that has no effect. The four CLI options -current-user-id, -verifier-user-id are unconditionally registered on every variant, including the non-secured ones, so a write-bulk invocation that passes -current-user-id 42 silently does nothing. Either remove secured and the dead _ = secured comment, or use it to gate the registration of secured-only flags so wrong combinations are rejected with a clean error.
Recommendation: Drop the secured parameter (the command switch already distinguishes the four variants) and the misleading _ = secured line; or, if validation is the goal, branch flag registration on secured so secured-only flags are unavailable for the non-secured variants and emit a clean usage error if they appear.
Resolution: 2026-05-20 — Dropped the unused secured parameter from runWriteBulkVariant (the command switch already distinguishes the four variants) and removed the misleading _ = secured line. The variant is now derived locally from command and used to gate flag registration: -current-user-id / -verifier-user-id are only registered for the secured variants and -user-id only for Write/Write2, so a wrong-variant flag now fails with a clean flag provided but not defined usage error instead of silently no-op'ing. The four runWrite*Bulk wrappers were updated to match the new signature.
Client.Go-016
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | clients/go/mxgateway/galaxy_test.go:382-429 |
| Status | Resolved |
Description: fakeGalaxyServer.watchSendInterval is declared on the test fake and consulted inside WatchDeployEvents (if s.watchSendInterval > 0 { ... }) but no test in the package sets a non-zero value. The dead field plus its branch were presumably added to support a backpressure / pacing test that was never landed, and now the only effect is reader confusion ("which test uses this?") and a pointlessly larger fake. Backpressure on the bootstrap-plus-events sequence is also genuinely worth testing, given that WatchDeployEvents writes to a 16-deep buffered channel.
Recommendation: Either delete the unused watchSendInterval field and its branch in WatchDeployEvents, or add the test it was added for — e.g. one that pumps more than 16 events with a small interval and asserts the consumer keeps up without losing or reordering events. Linking the field to a // for TestX comment if it stays would also help.
Resolution: 2026-05-20 — Removed the unused watchSendInterval field from fakeGalaxyServer and the corresponding if s.watchSendInterval > 0 { ... } branch in WatchDeployEvents; no test set the field, so the dead code path is gone and the fake is leaner. gofmt -w reflowed the struct to drop the no-longer-needed field-name padding.
Client.Go-017
| Field | Value |
|---|---|
| Severity | Low |
| Category | Error handling & resilience |
| Location | clients/go/cmd/mxgw-go/main.go:954-991 |
| Status | Resolved |
Description: parseValue returns the raw strconv.ParseBool / strconv.ParseInt / strconv.ParseFloat error verbatim — no wrap with %w and no indication of which CLI flag was the source. A user running mxgw-go write -type int32 -value foo sees
strconv.ParseInt: parsing "foo": invalid syntax
with no mention of -value, -type, or which subcommand failed. The same pattern hits every typed branch (bool, int32, int64, float, double). Compare with the sibling helpers in the same file: parseInt32List wraps with "invalid item handle %q: %w" (Client.Go-003 resolution) and parseRfc3339Timestamp wraps with "invalid RFC 3339 timestamp %q: %w". parseValue was missed and is inconsistent with those two. The GoStyleGuide (docs/style-guides/GoStyleGuide.md, "Errors" section) requires "Wrap errors with useful context using %w."
Recommendation: Wrap each strconv error with the offending input and type, e.g. return nil, fmt.Errorf("invalid %s value %q: %w", valueType, valueText, err). The wrapper handles all five typed branches uniformly without a per-branch change.
Resolution: 2026-05-20 — Each typed branch of parseValue now wraps the bare strconv error with %w and names the offending flag and value ("invalid -value for -type %s: %q: %w"), so mxgw-go write -type int32 -value foo surfaces the source flag, the requested type, and the bad token while still letting errors.Is/As reach the underlying strconv sentinel. The new TestParseValueWrapsStrconvErrorWithFlagContext table-test pins all five typed branches (bool, int32, int64, float, double) to the new wrapper shape.
Client.Go-018
| Field | Value |
|---|---|
| Severity | Low |
| Category | Concurrency & thread safety |
| Location | clients/go/cmd/mxgw-go/main.go:593-623 |
| Status | Resolved |
Description: runBenchReadBulk's warm-up and steady-state loops are wall-clock-only:
for time.Now().Before(warmupDeadline) {
_, _ = session.ReadBulk(ctx, serverHandle, tags, timeout)
}
...
for time.Now().Before(steadyDeadline) {
callStart := time.Now()
results, err := session.ReadBulk(ctx, serverHandle, tags, timeout)
...
}
Neither loop checks ctx.Done() / ctx.Err(). If the parent context is cancelled (e.g. the operator Ctrl+Cs the benchmark, or the cross-language bench driver scripts/bench-read-bulk.ps1 times out and kills the child early), the loops keep iterating until their wall-clock deadlines elapse. Each ReadBulk call inside fails fast (the gRPC call inherits the cancelled context and returns context.Canceled), but the steady-state loop counts those as failedCalls++ and keeps spinning — wasting CPU and inflating the failedCalls and latencyMs.max figures the PowerShell driver collates across all five clients. The .NET, Rust, Python, and Java bench drivers should be checked for the same shape, but the Go one is the only one being reviewed here. Note that runBenchReadBulk is the only Go CLI command that does NOT register its own signal handler (compare with runGalaxyWatch which does via signal.NotifyContext).
Recommendation: Drop out of both loops as soon as ctx.Err() != nil. Concretely, change the loop conditions to for time.Now().Before(warmupDeadline) && ctx.Err() == nil (and the same on steadyDeadline), or use a select { case <-ctx.Done(): break loop; default: } guard at the top of each iteration. The cross-language bench shape (durationMs, totalCalls, failedCalls, latencyMs) stays the same — the bench just exits sooner and reports the truncated window faithfully.
Resolution: 2026-05-20 — Both the warm-up and steady-state loops in runBenchReadBulk now carry an && ctx.Err() == nil guard alongside the wall-clock check, so a cancelled parent context (Ctrl+C, or the cross-language bench driver killing the child early) breaks the loop instead of spinning failing ReadBulk calls until the deadline elapses. The cross-language bench JSON shape is unchanged — the truncated window is just reported faithfully via durationMs / totalCalls.
Client.Go-019
| Field | Value |
|---|---|
| Severity | Low |
| Category | Documentation & comments |
| Location | clients/go/cmd/mxgw-go/main.go:710-716, clients/go/cmd/mxgw-go/main.go:1204,1213 |
| Status | Resolved |
Description: The CLI advertises two timestamp flags as "RFC3339" but parses them with different layouts:
-timestamp-value(write2/write-secured2 bulk):parseRfc3339Timestampusestime.RFC3339Nano, which accepts both2026-04-28T10:00:00Zand2026-04-28T10:00:00.123456789Z.-last-seen-deploy-time(galaxy-watch):time.Parse(time.RFC3339, ...), which rejects fractional seconds.
A user copy-pasting an ObservedAt timestamp from galaxy-watch -json (which is emitted as RFC3339Nano by formatDeployEvent) directly into -last-seen-deploy-time will get a parse error if the source value carried a fractional component, even though both flag descriptions say "RFC3339". The flag help string at main.go:1204 literally says "RFC3339 timestamp", and the README example uses 2026-04-28T10:00:00Z (whole seconds only), so the issue is silent until a fractional timestamp comes from the gateway.
Recommendation: Switch the galaxy-watch parse to time.RFC3339Nano to match parseRfc3339Timestamp (and the gateway's own emit format). One line change at main.go:1213. While there, update the flag help string and the README example to say "RFC 3339 (with optional fractional seconds)" so the two flags are documented uniformly.
Resolution: 2026-05-20 — runGalaxyWatch now parses -last-seen-deploy-time with time.RFC3339Nano, matching parseRfc3339Timestamp and the gateway's own formatDeployEvent emit format; the layout is strictly broader than the previous time.RFC3339 (whole-second values still parse). The flag help string changed to "RFC 3339 timestamp (with optional fractional seconds)" and the clients/go/README.md example was extended with an explicit fractional-seconds line so the two flags advertise the same surface.
Client.Go-020
| Field | Value |
|---|---|
| Severity | Low |
| Category | Code organization & conventions |
| Location | clients/go/cmd/mxgw-go/main.go:753-802, clients/go/cmd/mxgw-go/main.go:1199-1275 |
| Status | Resolved |
Description: The two long-running stream commands have divergent Ctrl+C UX:
-
runGalaxyWatchregisters a signal handler:signalCtx, stopSignals := signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM) defer stopSignals() streamCtx, cancelStream := context.WithCancel(signalCtx)so Ctrl+C drains buffered events and returns cleanly.
-
runStreamEventsdoes not register any signal handler — its parent context iscontext.Background()fromrunWithIO, so Ctrl+C abruptly kills the process. The deferredsubscription.Close()andclient.Close()never run, leaving the server-side stream to fault out on a torn TCP connection rather than a clean cancel.
The two commands are otherwise structurally identical (subscribe + loop until limit or external stop) — the inconsistency is one half of a pair that was missed when galaxy-watch was added. Worth flagging because it directly affects what an integrator who Ctrl+Cs stream-events sees in the gateway's logs (a transport reset rather than a codes.Canceled).
Recommendation: Mirror the runGalaxyWatch pattern in runStreamEvents: wrap ctx in signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM), derive streamCtx from it, and let defer subscription.Close() / defer cancelStream() tear the stream down on signal. The change is roughly six lines and brings the two stream commands into parity. Optionally factor a shared withSignals(ctx) (context.Context, context.CancelFunc) helper if a third stream command lands.
Resolution: 2026-05-20 — runStreamEvents now installs signal.NotifyContext(ctx, os.Interrupt, syscall.SIGTERM) (with a deferred stopSignals()) and derives streamCtx from the resulting signal-aware context, mirroring runGalaxyWatch. Ctrl+C now cancels the gRPC stream cleanly — the gateway sees codes.Canceled instead of a torn TCP connection — and the deferred subscription.Close() / client.Close() actually run on signal. The two long-running stream commands now share the same shutdown UX.
Client.Go-021
| Field | Value |
|---|---|
| Severity | Low |
| Category | Testing coverage |
| Location | clients/go/cmd/mxgw-go/main_test.go, clients/go/cmd/mxgw-go/main.go:363-520,522-655 |
| Status | Resolved |
Description: The six bulk / bench subcommands wired into run (read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk) have no CLI-level unit tests in main_test.go. In particular, the Client.Go-015 resolution claims:
-current-user-id/-verifier-user-idare only registered for the secured variants and-user-idonly for Write/Write2, so a wrong-variant flag now fails with a cleanflag provided but not definedusage error instead of silently no-op'ing.
But there is no test asserting that, e.g., mxgw-go write-bulk -current-user-id 1 ... returns a "flag provided but not defined" error, or that mxgw-go write-secured-bulk -user-id 1 ... does the same. A future refactor of runWriteBulkVariant (notably one that re-introduced the secured parameter) could silently re-permit the wrong flags without breaking any test. The same gap applies to: parameter validation in runReadBulk (bulk size, empty session/items rejection), the value-count vs handle-count mismatch error in runWriteBulkVariant:447, and runBenchReadBulk's bulk-size/duration-seconds positivity checks.
mxgateway/client_session_test.go already covers the library-level happy paths (TestWriteBulkBuildsOneBulkCommandAndReturnsPerEntryResults, TestReadBulkForwardsTimeoutAndUnpacksCachedFlag, TestSubscribeBulkBuildsOneBulkCommandAndReturnsResults), so this finding is about CLI surface area only.
Recommendation: Add table-driven tests in cmd/mxgw-go/main_test.go along the existing TestParseInt32List* and TestParseValueBuildsTypedValue style:
TestRunWriteBulkVariantGatesSecuredFlags: invokerunWithIOwithwrite-bulk -current-user-id 1 ...andwrite-secured-bulk -user-id 1 ..., assert each returns an error matchingflag provided but not defined.TestRunReadBulkRejectsMissingArgs: invokerunWithIOwithread-bulk(no-session-id), assert the documented "session-id and items are required" error.TestRunBenchReadBulkRejectsNonPositiveBulkSize/TestRunBenchReadBulkRejectsNonPositiveDuration: pin the positivity checks atmain.go:544-549.TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues: pin thelen(handles) != len(valueTexts)error atmain.go:447.
Each is a few lines and routes through the existing runWithIO entry point, so it does not need a bufconn fake.
Resolution: 2026-05-20 — Added CLI-level table-driven regression tests in cmd/mxgw-go/main_test.go routed through runWithIO, so they need no bufconn fake: TestRunWriteBulkVariantGatesSecuredFlags pins Client.Go-015 by asserting write-bulk -current-user-id, write-bulk -verifier-user-id, write2-bulk -current-user-id, write-secured-bulk -user-id, and write-secured2-bulk -user-id all surface flag provided but not defined; TestRunReadBulkRejectsMissingArgs pins the "session-id and items are required" check across no-flags / missing-items / missing-session-id; TestRunBenchReadBulkRejectsNonPositiveBulkSize and TestRunBenchReadBulkRejectsNonPositiveDuration pin the positivity checks; TestRunWriteBulkVariantRejectsMismatchedHandlesAndValues pins the explicit item-handles count ... does not match values count ... error. go test ./... passes.