From 6030bfa18ee2c81c5d1447cb217cb24f332c1840 Mon Sep 17 00:00:00 2001 From: Joseph Doherty Date: Tue, 16 Jun 2026 16:19:49 -0400 Subject: [PATCH] =?UTF-8?q?docs:=20design=20for=20stillpending=20=C2=A78?= =?UTF-8?q?=20completion=20(Approach=20C)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Also codify targeted-test-per-task rule in CLAUDE.md Source Update Workflow. --- CLAUDE.md | 2 + ...2026-06-16-stillpending-section8-design.md | 171 ++++++++++++++++++ 2 files changed, 173 insertions(+) create mode 100644 docs/plans/2026-06-16-stillpending-section8-design.md diff --git a/CLAUDE.md b/CLAUDE.md index 37165b0..99ac3c3 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -85,6 +85,8 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1 When source code changes, build and test the affected component before reporting work done. If the change crosses component boundaries, build each affected component — don't rely on a single top-level build: +**Run targeted tests per task, never the full suite each time.** When executing a plan task-by-task, run only the tests that exercise the code that task touched (`dotnet test --filter "FullyQualifiedName~"`, or the per-task test named in the plan). The full gateway suite is slow and leaves orphaned testhost processes — run it at most once per phase (after a related batch of tasks lands), not after every task. + | Changed area | Required verification | |---|---| | Contracts or `.proto` files | regenerate generated code, then build gateway, worker, and every generated client touched by the contract | diff --git a/docs/plans/2026-06-16-stillpending-section8-design.md b/docs/plans/2026-06-16-stillpending-section8-design.md new file mode 100644 index 0000000..e0a04b7 --- /dev/null +++ b/docs/plans/2026-06-16-stillpending-section8-design.md @@ -0,0 +1,171 @@ +# Still-Pending §8 Completion — Design + +> **Status:** Approved 2026-06-16. Next step: `superpowers-extended-cc:writing-plans`. + +**Goal:** Close the actionable items in `stillpending.md` §8 ("Deferred test-coverage +follow-ups, never filed as findings") — the only Bucket-A work that is neither +vendor-gated nor live-rig-gated and is not already covered by the session-resilience +epic plan. + +**Scope decision:** Bucket A only (actionable code/test work). The session-resilience +epic (Tasks 13–28) is already planned in `docs/plans/2026-06-15-session-resilience.md` +and is explicitly **out of scope** here — resume it separately. Vendor-gated +(§1.4/§3.4/§3.5) and live-rig/capture-gated (§1.3/§3.x/§5/§6.1) items cannot be +completed from this dev box and are out of scope. + +**Approach:** "C" — the complete option, including new in-process gRPC test +infrastructure for the Java streaming/galaxy CLI commands and a full bounded +ready-wait in the gateway session hot path. + +--- + +## Important correction (verified 2026-06-16) + +The three §8 items cite findings marked **Resolved** in the review backlog, but those +resolutions did **not** survive into the current tree: + +- The Java bulk-family CLI tests that `Client.Java-026` (resolved 2026-05-20) describes + were written against the old `com.dohertylan.mxgateway` package. After the rename to + `com.zb.mom.ww`, the current + `clients/java/zb-mom-ww-mxgateway-cli/.../MxGatewayCliTests.java` has **zero** coverage + for `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, + `write-secured2-bulk`, `bench-read-bulk`, `stream-events`, `close-session`, + `galaxy-discover`, `galaxy-watch`. (`galaxy-test-connection`/`galaxy-last-deploy`/ + `galaxy-browse`/`stream-alarms` **do** have tests now.) +- `Server-030` (both states in the not-ready diagnostic) **is** done — confirmed at + `src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1676`. The *deferred + follow-up* — should the gateway briefly wait for worker-Ready before failing fast? — + is genuinely unbuilt. +- `Tests-023` extracted a canonical `TestSupport/FakeWorkerProcess(int)`, yet three test + files still define private nested copies. + +So §8's gap is real and current. + +--- + +## Workstreams + +Four independently landable workstreams. + +| WS | Title | Files (language) | Classification | Depends on | +|----|-------|------------------|----------------|------------| +| A | Synchronous Java CLI tests (7 commands) | Java CLI test | small | — | +| B | In-process gRPC harness + streaming/galaxy CLI tests (3 commands) | Java CLI test + small CLI seam | standard | A (shares test file) | +| C | Worker-Ready bounded ready-wait | C# server session hot path | high-risk | — | +| D | `FakeWorkerProcess` consolidation | C# tests | small | — | + +A, C, D are mutually independent (disjoint files/languages) and may be dispatched in +parallel. B follows A because both edit `MxGatewayCliTests.java`. + +--- + +## WS-A — Synchronous Java CLI tests + +**What:** Round-trip CLI tests for the 7 commands testable through the existing +`FakeSession`/`FakeClient` seam (the same seam `subscribe-bulk`/`write` already use): +`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`, +`bench-read-bulk`, `close-session`. + +**How:** Upgrade `FakeSession` (currently returns empty lists) to per-call recorders +that capture the parsed entries (timeout, typed values via the shared `parseValue(type, +text)` switch, user-ids, timestamp) and synthesize one `BulkReadResult`/`BulkWriteResult` +per requested handle, so JSON-shape assertions exercise the +`bulkReadResultMap`/`bulkWriteResultMap` serializers. One `@Test` per command: + +- `read-bulk`: `--timeout-ms` reaches session; JSON carries `tagAddress`/`itemHandle`/ + `wasCached`/`quality`. +- `write-bulk`: `--type int32 --values 111,222 --user-id 5` parses through `parseValue`; + entries built with the expected typed `MxValue` + `userId`. +- `write2-bulk`: `--timestamp …Z` reaches the entry as `timestampValue` + (`hasTimestampValue()` true). +- `write-secured-bulk`: `--current-user-id`/`--verifier-user-id` both propagate. +- `write-secured2-bulk`: timestamp + both user-ids. +- `bench-read-bulk`: 1s steady / 0s warmup; assert cross-language schema keys + (`language=java`, `command=bench-read-bulk`, `totalCalls`, `successfulCalls`, + `failedCalls`, `callsPerSecond`, `latencyMs.p50/p95/p99`). +- `close-session`: `CloseSessionReply` round-trips through `FakeClient`. + +**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`. + +--- + +## WS-B — In-process gRPC harness + streaming/galaxy CLI tests + +**Why infra is required:** `MxEventStream` and `DeployEventStream` have package-private +constructors; `GalaxyRepositoryClient` is `final` with a static `connect()` and +`GalaxyCommand` has **no** injectable factory. None of `stream-events`/`galaxy-watch`/ +`galaxy-discover` can be faked through the `FakeSession` seam. + +**What:** A JUnit fixture that starts a gRPC **`InProcessServer`** hosting scripted +`MxAccessGateway` + `GalaxyRepository` service implementations and exposes an in-process +`Channel`. The **real** `MxGatewayClient`/`GalaxyRepositoryClient` connect to it, so the +real `MxEventStream`/`DeployEventStream` queue-draining and `GalaxyRepositoryClient` +paging are exercised end-to-end (highest fidelity; no reflection, no package hacks). + +- **Production change (CLI module only, not the library):** add a `GalaxyClientFactory` + seam to `GalaxyCommand` mirroring the existing `MxGatewayCliClientFactory`, so galaxy + commands can target the in-process channel. +- `stream-events`: server streams a scripted `MxEvent` sequence → assert CLI render, + including the unsigned-uint64 worker-sequence regression. +- `galaxy-watch`: server streams scripted deploy events → assert CLI feed output. +- `galaxy-discover`: server returns a paged `GalaxyObject` hierarchy → assert CLI JSON. + +The 7 synchronous commands stay on the lightweight `FakeSession` seam (YAGNI — no reason +to route them through a server). + +**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`. + +--- + +## WS-C — Worker-Ready bounded ready-wait + +**Problem:** `GetReadyWorkerClient` (`GatewaySession.cs:1665`) fails fast when the session +is `Ready` but the worker client's `WorkerClientState` has diverged (`Handshaking` after a +heartbeat blip, etc.). The both-states diagnostic exists; a brief wait does not. + +**Constraint:** the check runs inside the `_syncRoot` lock — we cannot sleep/poll there. + +**Design (pinned decisions):** + +- New `GetReadyWorkerClientAsync`: read state under `_syncRoot`; **if** session is `Ready` + but worker is **transient** (`Handshaking`/`Created`), release the lock, poll at a short + interval (e.g. 25 ms) until the worker reaches `Ready` or a bounded timeout elapses, then + re-check under the lock. +- **Terminal worker states (`Faulted`/`Closing`/`Closed`/null) fail fast immediately** — + never wait; retrying a faulted worker is pointless and would mask the fault. +- New config `MxGateway:Sessions:WorkerReadyWaitTimeout` on `GatewaySessionOptions`, + **default `0` = disabled** (preserves today's exact fail-fast behavior unless opted in), + validated `>= 0` by the options validator. Document in `docs/GatewayConfiguration.md`. +- The both-states diagnostic is preserved for the final failure. Callers at + `GatewaySession.cs:918` and `:1263` become `await`. + +**Tests:** +- Handshaking→Ready within the timeout succeeds (worker invoked once). +- Faulted fails fast with both states in the message, zero waiting. +- Timeout elapses → fails with both states. +- Default `0` → unchanged fail-fast (no wait, no behavior change). + +**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~SessionManager"` +(plus the options-validator test class). + +--- + +## WS-D — `FakeWorkerProcess` consolidation + +**What:** Replace the private nested `FakeWorkerProcess` in +`SessionWorkerClientFactoryFakeWorkerTests`, `WorkerProcessLauncherTests`, and +`WorkerClientTests` with the canonical `TestSupport/FakeWorkerProcess(int)` (which already +has `MarkExited`/`Kill`/TCS-backed `WaitForExitAsync`). Where a nested copy carries extra +behavior the canonical lacks, fold that into the canonical first, then delete the copies. + +**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~WorkerClient | FullyQualifiedName~WorkerProcessLauncher | FullyQualifiedName~SessionWorkerClientFactory"`. + +--- + +## Testing & sequencing + +Per the targeted-test rule in `CLAUDE.md` (Source Update Workflow): each task runs only +its own filtered tests. Run the full gateway suite at most once, after WS-C + WS-D land. + +Out-of-scope items remain recorded in `stillpending.md` (vendor/rig-gated) and the +session-resilience epic (`oldtasks.md`).