Files
mxaccessgw/code-reviews/Client.Java/findings.md
Joseph Doherty 6eb9ea9105 Resolve Client.Java-006..012 code-review findings
Client.Java-006: close() on both clients only called shutdown(). It now
awaits termination up to the connect timeout and shutdownNow()s on timeout.

Client.Java-007: added MxGatewayLowFindingsTests covering the alarm surface,
async streaming, MxEventStream overflow, and TLS channel construction. A
latent bug surfaced: a missing CA file throws IllegalArgumentException, not
SSLException — the channel-builder catch was broadened accordingly.

Client.Java-008: async thenApply sites now route stray RuntimeExceptions
through MxGatewayErrors.fromGrpc via a normalising validator.

Client.Java-009: extracted ~80 duplicated lines (createChannel, withDeadline,
toCompletable, ...) into a shared MxGatewayChannels; both clients delegate.

Client.Java-010 (re-triaged): the README's metadata:read scope was correct;
the acknowledgeAlarm Javadoc's invoke:alarm-ack was wrong — corrected to the
admin scope.

Client.Java-011: documented the intentional fail-fast event-stream
backpressure in Javadoc and the README.

Client.Java-012: replaced CommonOptions.resolved()'s mutate-and-return-this
with side-effect-free resolvedApiKey()/resolvedTimeout() accessors.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-18 22:42:51 -04:00

22 KiB

Code Review — Client.Java

Field Value
Module clients/java
Reviewer Claude Code
Review date 2026-05-18
Commit reviewed 3cc53a8
Status Reviewed
Open findings 0

Checklist coverage

# Category Result
1 Correctness & logic bugs Issues found: register/addItem silently fall back to getReturnValue() masking missing payloads (Client.Java-004); fragile resolved() mutation pattern (Client.Java-012).
2 mxaccessgw conventions Largely adheres; the gateway protocol-version handshake is never verified despite the contract field existing (Client.Java-003).
3 Concurrency & thread safety Issue found: MxEventStream.next is a plain field and terminal-state transitions race (Client.Java-002).
4 Error handling & resilience Issues found: close() can mask the primary exception (Client.Java-005); async/sync error surfaces inconsistent (Client.Java-008).
5 Security Issue found: API-key redaction leaks the trailing 4 secret characters (Client.Java-001).
6 Performance & resource management Issues found: close() does not await termination (Client.Java-006); no stream flow control (Client.Java-011).
7 Design-document adherence Matches JavaClientDesign.md closely; the protocol-version check is undocumented-missing (Client.Java-003).
8 Code organization & conventions Issue found: ~80 duplicated lines across the two clients (Client.Java-009).
9 Testing coverage Issue found: alarm RPCs, TLS setup, async streams, and queue overflow untested (Client.Java-007).
10 Documentation & comments Issue found: README/Javadoc assert undocumented scope names (Client.Java-010).

Findings

Client.Java-001

Field Value
Severity Medium
Category Security
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySecrets.java:30-32
Status Resolved

Description: redactApiKey preserves the leading and trailing four characters of the key. A gateway API key has the form mxgw_<key-id>_<secret>; the last four characters belong to the secret portion, so the "redacted" form leaks 4 characters of the actual secret into logs, CLI JSON output (CommonOptions.redactedJsonMap), and MxGatewayClientOptions.toString(). CLAUDE.md states API keys must never reach logs.

Recommendation: Redact the secret entirely. Show only a stable non-secret prefix (e.g. the mxgw_<key-id>_ portion) and mask everything after it, or emit a fixed mxgw_*** form. Do not echo any trailing characters of the secret.

Resolution: (2026-05-18) Confirmed against source: the old substring(0,4) + stars + substring(len-4) echoed the last four secret characters. redactApiKey now masks the secret entirely: for gateway-shaped keys it returns the non-secret mxgw_<key-id>_ prefix followed by *** (locating the secret separator as the first _ after mxgw_); any non-gateway-shaped token returns <redacted>. No leading/trailing secret characters are ever emitted. The pre-existing MxGatewayCliTests.openSessionJsonRedactsApiKey assertion that hardcoded the leaky mxgw***********cret form was corrected to assert the masked mxgw_visible_*** form. Regression tests: MxGatewayMediumFindingsTests.redactApiKeyDoesNotLeakAnyCharacterOfTheSecret, redactApiKeyForNonGatewayShapedKeyRevealsNothing, redactApiKeyStillHandlesNullAndShortInput.

Client.Java-002

Field Value
Severity Medium
Category Concurrency & thread safety
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:31,66-92
Status Resolved

Description: The next field is a plain (non-volatile) instance field, and MxEventStream exposes no thread-confinement guarantee. More concretely, a queue-overflow offer() and a close() offer(END) can interleave so the overflow exception is enqueued after END and never observed — the contract that "next() throws after overflow" is not guaranteed once close() has been called.

Recommendation: Document single-consumer-thread usage explicitly in the Javadoc, and serialise terminal state transitions (overflow vs END vs close) behind a single guarded flag so the first terminal condition wins deterministically.

Resolution: (2026-05-18) Confirmed against source: the old offer() END-branch did queue.clear(); queue.offer(END) when full, so a close() arriving after an overflow wiped the already-enqueued overflow exception, leaving the consumer with a clean end-of-stream and the overflow silently lost. Terminal transitions are now serialised through a single terminate(MxGatewayException) method guarded by a terminated flag and a terminalLock; the first terminal condition wins and a later close()/END cannot overwrite a published overflow fault. The Javadoc now explicitly documents that the iterator methods are single-consumer-only while close() is safe from any thread. Regression tests: MxGatewayMediumFindingsTests.eventStreamOverflowExceptionSurvivesASubsequentClose (deterministic) and eventStreamConcurrentOverflowAndCloseAlwaysTerminate (300-iteration race stress).

Client.Java-003

Field Value
Severity Medium
Category mxaccessgw conventions
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:119-140
Status Resolved

Description: OpenSessionReply carries gateway_protocol_version (proto field 8), and MxGatewayClientVersion.GATEWAY_PROTOCOL_VERSION exists so the client can reject incompatible generated-code inputs. The client never reads reply.getGatewayProtocolVersion() nor compares it against the compiled-in version. A client built against an older/newer contract issues commands blindly and fails with confusing downstream errors instead of a clear version-mismatch failure.

Recommendation: In openSession/openSessionRaw, compare reply.getGatewayProtocolVersion() with MxGatewayClientVersion.gatewayProtocolVersion() and throw a typed MxGatewayException on mismatch.

Resolution: (2026-05-18) Confirmed against source: neither openSessionRaw nor openSessionAsync read getGatewayProtocolVersion(). Added a private ensureGatewayProtocolCompatible helper, called from both openSessionRaw and openSessionAsync, that throws MxGatewayException with a clear mismatch message when the gateway reports a non-zero version differing from MxGatewayClientVersion.gatewayProtocolVersion(). A gateway that leaves the field unset (value 0, e.g. an older gateway) is accepted unchanged for backward compatibility. clients/java/README.md documents the new fail-fast check. Regression tests: MxGatewayMediumFindingsTests.openSessionRejectsIncompatibleGatewayProtocolVersion and openSessionAcceptsMatchingOrUnsetGatewayProtocolVersion.

Client.Java-004

Field Value
Severity Medium
Category Correctness & logic bugs
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:114-120,157-163,191-197
Status Resolved

Description: register, addItem, and addItem2 check reply.hasRegister()/hasAddItem() and otherwise fall back to reply.getReturnValue().getInt32Value(). If the gateway returns a reply with neither the typed payload nor a return_value set, the method silently returns 0 — indistinguishable from a legitimate handle of 0. This masks a contract violation rather than surfacing it.

Recommendation: If the expected typed payload is absent and no return_value is present, throw MxGatewayException (protocol violation) instead of returning 0.

Resolution: (2026-05-18) Confirmed against source: all three methods returned reply.getReturnValue().getInt32Value() (which yields 0 for an unset message field) when the typed payload was absent. Each method now guards the fallback with reply.hasReturnValue() and throws MxGatewayException describing the protocol violation when neither the typed payload nor a return_value is present. The legitimate return_value fallback is preserved. Regression tests: MxGatewayMediumFindingsTests.registerThrowsWhenReplyHasNeitherTypedPayloadNorReturnValue, addItemThrowsWhenReplyHasNeitherTypedPayloadNorReturnValue, and addItemStillHonoursReturnValueFallback.

Client.Java-005

Field Value
Severity Medium
Category Error handling & resilience
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewaySession.java:92-105
Status Resolved

Description: close() delegates to closeRaw(), which performs a network RPC. When MxGatewaySession is used in try-with-resources and the body throws, a failure inside closeSession (e.g. WORKER_UNAVAILABLE) throws from close() and replaces the original exception as the propagated throwable (the body exception becomes a suppressed exception) — a known try-with-resources footgun for I/O-performing close().

Recommendation: Either make close() swallow/log close-time failures (keeping closeRaw() for callers who want the result), or document clearly that close() performs a network call that can throw.

Resolution: (2026-05-18) Confirmed against source: close() called closeRaw() directly, so a CloseSession RPC failure propagated out of try-with-resources and replaced the body exception. close() now catches MxGatewayException from closeRaw() and logs it at WARNING via System.Logger instead of rethrowing, so a close-time failure never masks the body exception. closeRaw() is unchanged and still throws for callers who want to observe the close result. The behavior change and the recommendation to use closeRaw() for explicit close handling are documented in clients/java/README.md and the close() Javadoc. Regression tests: MxGatewayMediumFindingsTests.closeSuppressesCloseTimeFailureInsteadOfMaskingBodyException and closeRawStillSurfacesCloseTimeFailureForCallersWhoWantIt.

Client.Java-006

Field Value
Severity Low
Category Performance & resource management
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:323-328, clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:279-284
Status Resolved

Description: close() (the AutoCloseable method invoked by try-with-resources) calls only ownedChannel.shutdown() and returns immediately without awaiting termination. In-flight calls and Netty event-loop threads may still be running when the caller assumes the resource is released. closeAndAwaitTermination() does it correctly but is not the method try-with-resources uses, and the README examples all rely on try-with-resources.

Recommendation: Have close() await termination for a bounded time and shutdownNow() on timeout (the logic already in closeAndAwaitTermination()), or document that try-with-resources callers should call closeAndAwaitTermination().

Resolution: (2026-05-18) Confirmed against source: both MxGatewayClient.close() and GalaxyRepositoryClient.close() called only ownedChannel.shutdown(). close() in both clients now performs the bounded-wait logic previously only in closeAndAwaitTermination(): it shuts the channel down, waits up to the configured connect timeout for graceful termination, and calls shutdownNow() on timeout. Because close() cannot throw a checked exception, an InterruptedException while awaiting is handled by forcibly shutting the channel down and restoring the thread interrupt flag. closeAndAwaitTermination() is retained unchanged for callers who want the checked, blocking-aware variant. clients/java/README.md documents the new try-with-resources close() semantics.

Client.Java-007

Field Value
Severity Low
Category Testing coverage
Location clients/java/mxgateway-client/src/test/java/com/dohertylan/mxgateway/client/
Status Resolved

Description: The alarm surface — acknowledgeAlarm/acknowledgeAlarmAsync/queryActiveAlarms and MxGatewayActiveAlarmsSubscription — has zero test coverage. TLS channel construction, the async streamEventsAsync path, MxGatewayEventSubscription pre-start cancellation, and MxEventStream queue overflow are likewise untested. JavaClientDesign.md explicitly lists async stream-observer cancellation and status/error mapping as required tests.

Recommendation: Add in-process gRPC tests for the alarm RPCs, the async streaming/subscription cancellation paths, and at least one TLS-config construction test.

Resolution: (2026-05-18) Confirmed against source: no test referenced acknowledgeAlarm, queryActiveAlarms, streamEventsAsync, TLS construction, or MxEventStream overflow. Added MxGatewayLowFindingsTests (12 tests) covering: acknowledgeAlarm/acknowledgeAlarmAsync (success, typed protocol-failure, async transport-failure normalisation), queryActiveAlarms observer delivery, MxGatewayActiveAlarmsSubscription and MxGatewayEventSubscription pre-start cancellation, streamEventsAsync observer delivery, MxEventStream queue overflow surfacing MxGatewayException, TLS channel construction (missing CA file rejected with a typed exception, system-trust path builds cleanly), and the Client.Java-008 async-validator normalisation. While writing the TLS test a latent bug was found: a missing/unreadable CA file makes GrpcSslContexts throw IllegalArgumentException (not SSLException), which the old catch (SSLException) let escape unwrapped — the catch in the shared channel builder was broadened to also wrap RuntimeException so callers always see one typed MxGatewayException.

Client.Java-008

Field Value
Severity Low
Category Error handling & resilience
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:298-304
Status Resolved

Description: acknowledgeAlarmAsync and openSessionAsync apply ensureProtocolSuccess inside thenApply. If that validator throws a non-MxGatewayException RuntimeException it is wrapped by CompletionException with no fromGrpc normalisation, unlike the synchronous paths which normalise via try/catch. The async and sync error surfaces are therefore inconsistent.

Recommendation: Wrap the thenApply body so any non-MxGatewayException is routed through MxGatewayErrors.fromGrpc, matching the synchronous methods.

Resolution: (2026-05-18) Confirmed against source: the thenApply validators in openSessionAsync, invokeAsync, and acknowledgeAlarmAsync were not normalised — in practice the gateway's own validators (ensureProtocolSuccess, ensureMxAccessSuccess, ensureGatewayProtocolCompatible) only ever throw MxGatewayException, but a stray non-MxGatewayException RuntimeException (e.g. an NPE from a malformed reply) would surface raw inside CompletionException. Added MxGatewayChannels.normalisingValidator(operation, fn): it rethrows MxGatewayException unchanged and routes any other RuntimeException through MxGatewayErrors.fromGrpc, matching the synchronous try/catch paths. All three async thenApply sites now use it. Regression test: MxGatewayLowFindingsTests.openSessionAsyncNormalisesNonGatewayRuntimeExceptionFromValidator.

Client.Java-009

Field Value
Severity Low
Category Code organization & conventions
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/GalaxyRepositoryClient.java:310-391, clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:346-413
Status Resolved

Description: createChannel, withDeadline, withStreamDeadline, and toCompletable are duplicated nearly verbatim across MxGatewayClient and GalaxyRepositoryClient (~80 lines). A fix to one will not propagate to the other.

Recommendation: Extract the channel-builder and future-adaptor helpers into a shared package-private utility class.

Resolution: (2026-05-18) Confirmed against source: the four helpers were duplicated near-verbatim. Added a package-private MxGatewayChannels utility class holding createChannel(options, tlsErrorPrefix), withDeadline(stub, options), withStreamDeadline(stub, options), toCompletable(future, operation), and the new normalisingValidator helper (Client.Java-008). Both MxGatewayClient and GalaxyRepositoryClient now delegate to it and their private copies were deleted, so a future fix lives in one place. Behavior is unchanged except the operation-name carried into MxGatewayErrors.fromGrpc is now the specific RPC name instead of the generic "async call"/"galaxy async call". Verified by the full existing async test suite plus the new MxGatewayLowFindingsTests.

Client.Java-010

Field Value
Severity Low
Category Documentation & comments
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxGatewayClient.java:269-272, clients/java/README.md:76
Status Resolved

Description: The acknowledgeAlarm Javadoc states the gateway authenticates against an invoke:alarm-ack scope, and the README states the Galaxy Repository requires a metadata:read scope. CLAUDE.md's documented scope set names neither — the Javadoc/README assert a scope contract the project's own auth documentation does not corroborate.

Recommendation: Reconcile the scope names with src/MxGateway.Server/Security/ and CLAUDE.md; correct the Javadoc/README to the actual scope strings, or fix CLAUDE.md if sub-scopes were genuinely added.

Resolution: (2026-05-18) Partially re-triaged. Verified against src/MxGateway.Server/Security/Authorization/GatewayScopes.cs and GatewayGrpcScopeResolver.cs: the canonical scope catalog is session:open, session:close, invoke:read, invoke:write, invoke:secure, events:read, metadata:read, admin. (a) The README's metadata:read for the Galaxy Repository is correctTestConnectionRequest/GetLastDeployTimeRequest/DiscoverHierarchyRequest/WatchDeployEventsRequest all resolve to GatewayScopes.MetadataRead; no change needed. CLAUDE.md's prose lists only coarse scope groups, but the canonical resolver does define metadata:read. (b) The acknowledgeAlarm Javadoc's invoke:alarm-ack is wrong — no such scope exists. AcknowledgeAlarmRequest and QueryActiveAlarmsRequest are not special-cased in GatewayGrpcScopeResolver, so they fall through the _ => GatewayScopes.Admin default and require the admin scope. The Javadoc was corrected to state the admin scope; queryActiveAlarms did not assert a scope and was left unchanged. The README does not mention alarms, so no README change was required.

Client.Java-011

Field Value
Severity Low
Category Performance & resource management
Location clients/java/mxgateway-client/src/main/java/com/dohertylan/mxgateway/client/MxEventStream.java:37-63
Status Resolved

Description: The event stream relies on default gRPC auto-inbound flow control: the async stub auto-requests messages, so the server can push faster than the 16-element bounded queue drains. A momentarily slow consumer triggers queue overflow and an immediate stream-fault cancel. This is consistent with the documented fail-fast event-backpressure design, but the client never applies real flow control, so even brief consumer stalls kill the subscription.

Recommendation: Confirm fail-fast is intended (it appears to be); if so, document it on MxEventStream so callers know a slow consumer terminates the stream. Optionally expose the queue capacity or opt-in flow control.

Resolution: (2026-05-18) Confirmed fail-fast is intended — CLAUDE.md ("fail-fast event backpressure") and docs/DesignDecisions.md make a slow consumer losing its subscription a deliberate v1 design choice, so this is documentation-only, not a behavior bug. Added an explicit "Backpressure (fail-fast)" section to the MxEventStream class Javadoc explaining that the adaptor uses gRPC auto-inbound flow control with a fixed 16-element buffer and no client flow control, that a consumer stall long enough to fill the buffer triggers an overflow that cancels the subscription and surfaces an MxGatewayException, and that consumers must drain promptly and be ready to resubscribe with a resume cursor. clients/java/README.md carries the same caveat. The queue capacity was intentionally left non-configurable to keep the v1 surface aligned with the gateway design; overflow behavior is covered by MxGatewayLowFindingsTests.eventStreamQueueOverflowSurfacesExceptionFromNext.

Client.Java-012

Field Value
Severity Low
Category Correctness & logic bugs
Location clients/java/mxgateway-cli/src/main/java/com/dohertylan/mxgateway/cli/MxGatewayCli.java:667-674
Status Resolved

Description: CommonOptions.resolved() mutates this (resolvedApiKey, resolvedTimeout) and returns this, but toClientOptions() and redactedJsonMap() read those mutated fields. If redactedJsonMap() is ever called before resolved(), it silently emits empty-string defaults. The "return this after mutating" pattern is fragile and surprising.

Recommendation: Make resolved() return an immutable resolved value object, or compute resolvedApiKey/resolvedTimeout lazily in their getters so call ordering cannot produce stale output.

Resolution: (2026-05-18) Confirmed against source: resolved() populated the resolvedApiKey/resolvedTimeout mutable fields and toClientOptions()/redactedJsonMap() read them, so calling either before resolved() emitted stale empty/30s defaults. The two mutable fields were removed and replaced with side-effect-free accessor methods resolvedApiKey() and resolvedTimeout() that compute their value on each call (API key from --api-key or the --api-key-env variable; timeout via parseDuration). toClientOptions() and redactedJsonMap() now call those accessors directly, so call ordering can no longer produce stale output. resolved() is retained as a no-op returning this purely for call-site readability (common.resolved()), with its Javadoc updated to state resolution is now lazy. Pure-refactor with no runtime-behavior change for the existing call order, so no new test was added; covered by the existing MxGatewayCliTests JSON-redaction and option-parsing tests.