Files
mxaccessgw/docs/plans/2026-06-16-stillpending-section8-design.md
T
Joseph Doherty 6030bfa18e docs: design for stillpending §8 completion (Approach C)
Also codify targeted-test-per-task rule in CLAUDE.md Source Update Workflow.
2026-06-16 16:19:49 -04:00

8.6 KiB
Raw Blame History

Still-Pending §8 Completion — Design

Status: Approved 2026-06-16. Next step: superpowers-extended-cc:writing-plans.

Goal: Close the actionable items in stillpending.md §8 ("Deferred test-coverage follow-ups, never filed as findings") — the only Bucket-A work that is neither vendor-gated nor live-rig-gated and is not already covered by the session-resilience epic plan.

Scope decision: Bucket A only (actionable code/test work). The session-resilience epic (Tasks 1328) is already planned in docs/plans/2026-06-15-session-resilience.md and is explicitly out of scope here — resume it separately. Vendor-gated (§1.4/§3.4/§3.5) and live-rig/capture-gated (§1.3/§3.x/§5/§6.1) items cannot be completed from this dev box and are out of scope.

Approach: "C" — the complete option, including new in-process gRPC test infrastructure for the Java streaming/galaxy CLI commands and a full bounded ready-wait in the gateway session hot path.


Important correction (verified 2026-06-16)

The three §8 items cite findings marked Resolved in the review backlog, but those resolutions did not survive into the current tree:

  • The Java bulk-family CLI tests that Client.Java-026 (resolved 2026-05-20) describes were written against the old com.dohertylan.mxgateway package. After the rename to com.zb.mom.ww, the current clients/java/zb-mom-ww-mxgateway-cli/.../MxGatewayCliTests.java has zero coverage for read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk, stream-events, close-session, galaxy-discover, galaxy-watch. (galaxy-test-connection/galaxy-last-deploy/ galaxy-browse/stream-alarms do have tests now.)
  • Server-030 (both states in the not-ready diagnostic) is done — confirmed at src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1676. The deferred follow-up — should the gateway briefly wait for worker-Ready before failing fast? — is genuinely unbuilt.
  • Tests-023 extracted a canonical TestSupport/FakeWorkerProcess(int), yet three test files still define private nested copies.

So §8's gap is real and current.


Workstreams

Four independently landable workstreams.

WS Title Files (language) Classification Depends on
A Synchronous Java CLI tests (7 commands) Java CLI test small
B In-process gRPC harness + streaming/galaxy CLI tests (3 commands) Java CLI test + small CLI seam standard A (shares test file)
C Worker-Ready bounded ready-wait C# server session hot path high-risk
D FakeWorkerProcess consolidation C# tests small

A, C, D are mutually independent (disjoint files/languages) and may be dispatched in parallel. B follows A because both edit MxGatewayCliTests.java.


WS-A — Synchronous Java CLI tests

What: Round-trip CLI tests for the 7 commands testable through the existing FakeSession/FakeClient seam (the same seam subscribe-bulk/write already use): read-bulk, write-bulk, write2-bulk, write-secured-bulk, write-secured2-bulk, bench-read-bulk, close-session.

How: Upgrade FakeSession (currently returns empty lists) to per-call recorders that capture the parsed entries (timeout, typed values via the shared parseValue(type, text) switch, user-ids, timestamp) and synthesize one BulkReadResult/BulkWriteResult per requested handle, so JSON-shape assertions exercise the bulkReadResultMap/bulkWriteResultMap serializers. One @Test per command:

  • read-bulk: --timeout-ms reaches session; JSON carries tagAddress/itemHandle/ wasCached/quality.
  • write-bulk: --type int32 --values 111,222 --user-id 5 parses through parseValue; entries built with the expected typed MxValue + userId.
  • write2-bulk: --timestamp …Z reaches the entry as timestampValue (hasTimestampValue() true).
  • write-secured-bulk: --current-user-id/--verifier-user-id both propagate.
  • write-secured2-bulk: timestamp + both user-ids.
  • bench-read-bulk: 1s steady / 0s warmup; assert cross-language schema keys (language=java, command=bench-read-bulk, totalCalls, successfulCalls, failedCalls, callsPerSecond, latencyMs.p50/p95/p99).
  • close-session: CloseSessionReply round-trips through FakeClient.

Verify: gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests.


WS-B — In-process gRPC harness + streaming/galaxy CLI tests

Why infra is required: MxEventStream and DeployEventStream have package-private constructors; GalaxyRepositoryClient is final with a static connect() and GalaxyCommand has no injectable factory. None of stream-events/galaxy-watch/ galaxy-discover can be faked through the FakeSession seam.

What: A JUnit fixture that starts a gRPC InProcessServer hosting scripted MxAccessGateway + GalaxyRepository service implementations and exposes an in-process Channel. The real MxGatewayClient/GalaxyRepositoryClient connect to it, so the real MxEventStream/DeployEventStream queue-draining and GalaxyRepositoryClient paging are exercised end-to-end (highest fidelity; no reflection, no package hacks).

  • Production change (CLI module only, not the library): add a GalaxyClientFactory seam to GalaxyCommand mirroring the existing MxGatewayCliClientFactory, so galaxy commands can target the in-process channel.
  • stream-events: server streams a scripted MxEvent sequence → assert CLI render, including the unsigned-uint64 worker-sequence regression.
  • galaxy-watch: server streams scripted deploy events → assert CLI feed output.
  • galaxy-discover: server returns a paged GalaxyObject hierarchy → assert CLI JSON.

The 7 synchronous commands stay on the lightweight FakeSession seam (YAGNI — no reason to route them through a server).

Verify: gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests.


WS-C — Worker-Ready bounded ready-wait

Problem: GetReadyWorkerClient (GatewaySession.cs:1665) fails fast when the session is Ready but the worker client's WorkerClientState has diverged (Handshaking after a heartbeat blip, etc.). The both-states diagnostic exists; a brief wait does not.

Constraint: the check runs inside the _syncRoot lock — we cannot sleep/poll there.

Design (pinned decisions):

  • New GetReadyWorkerClientAsync: read state under _syncRoot; if session is Ready but worker is transient (Handshaking/Created), release the lock, poll at a short interval (e.g. 25 ms) until the worker reaches Ready or a bounded timeout elapses, then re-check under the lock.
  • Terminal worker states (Faulted/Closing/Closed/null) fail fast immediately — never wait; retrying a faulted worker is pointless and would mask the fault.
  • New config MxGateway:Sessions:WorkerReadyWaitTimeout on GatewaySessionOptions, default 0 = disabled (preserves today's exact fail-fast behavior unless opted in), validated >= 0 by the options validator. Document in docs/GatewayConfiguration.md.
  • The both-states diagnostic is preserved for the final failure. Callers at GatewaySession.cs:918 and :1263 become await.

Tests:

  • Handshaking→Ready within the timeout succeeds (worker invoked once).
  • Faulted fails fast with both states in the message, zero waiting.
  • Timeout elapses → fails with both states.
  • Default 0 → unchanged fail-fast (no wait, no behavior change).

Verify: dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~SessionManager" (plus the options-validator test class).


WS-D — FakeWorkerProcess consolidation

What: Replace the private nested FakeWorkerProcess in SessionWorkerClientFactoryFakeWorkerTests, WorkerProcessLauncherTests, and WorkerClientTests with the canonical TestSupport/FakeWorkerProcess(int) (which already has MarkExited/Kill/TCS-backed WaitForExitAsync). Where a nested copy carries extra behavior the canonical lacks, fold that into the canonical first, then delete the copies.

Verify: dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~WorkerClient | FullyQualifiedName~WorkerProcessLauncher | FullyQualifiedName~SessionWorkerClientFactory".


Testing & sequencing

Per the targeted-test rule in CLAUDE.md (Source Update Workflow): each task runs only its own filtered tests. Run the full gateway suite at most once, after WS-C + WS-D land.

Out-of-scope items remain recorded in stillpending.md (vendor/rig-gated) and the session-resilience epic (oldtasks.md).