docs: design for stillpending §8 completion (Approach C)

Also codify targeted-test-per-task rule in CLAUDE.md Source Update Workflow.
This commit is contained in:
Joseph Doherty
2026-06-16 16:19:49 -04:00
parent 82755a3623
commit 6030bfa18e
2 changed files with 173 additions and 0 deletions
+2
View File
@@ -85,6 +85,8 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1
When source code changes, build and test the affected component before reporting work done. If the change crosses component boundaries, build each affected component — don't rely on a single top-level build:
**Run targeted tests per task, never the full suite each time.** When executing a plan task-by-task, run only the tests that exercise the code that task touched (`dotnet test --filter "FullyQualifiedName~<TestClass>"`, or the per-task test named in the plan). The full gateway suite is slow and leaves orphaned testhost processes — run it at most once per phase (after a related batch of tasks lands), not after every task.
| Changed area | Required verification |
|---|---|
| Contracts or `.proto` files | regenerate generated code, then build gateway, worker, and every generated client touched by the contract |
@@ -0,0 +1,171 @@
# Still-Pending §8 Completion — Design
> **Status:** Approved 2026-06-16. Next step: `superpowers-extended-cc:writing-plans`.
**Goal:** Close the actionable items in `stillpending.md` §8 ("Deferred test-coverage
follow-ups, never filed as findings") — the only Bucket-A work that is neither
vendor-gated nor live-rig-gated and is not already covered by the session-resilience
epic plan.
**Scope decision:** Bucket A only (actionable code/test work). The session-resilience
epic (Tasks 1328) is already planned in `docs/plans/2026-06-15-session-resilience.md`
and is explicitly **out of scope** here — resume it separately. Vendor-gated
(§1.4/§3.4/§3.5) and live-rig/capture-gated (§1.3/§3.x/§5/§6.1) items cannot be
completed from this dev box and are out of scope.
**Approach:** "C" — the complete option, including new in-process gRPC test
infrastructure for the Java streaming/galaxy CLI commands and a full bounded
ready-wait in the gateway session hot path.
---
## Important correction (verified 2026-06-16)
The three §8 items cite findings marked **Resolved** in the review backlog, but those
resolutions did **not** survive into the current tree:
- The Java bulk-family CLI tests that `Client.Java-026` (resolved 2026-05-20) describes
were written against the old `com.dohertylan.mxgateway` package. After the rename to
`com.zb.mom.ww`, the current
`clients/java/zb-mom-ww-mxgateway-cli/.../MxGatewayCliTests.java` has **zero** coverage
for `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`,
`write-secured2-bulk`, `bench-read-bulk`, `stream-events`, `close-session`,
`galaxy-discover`, `galaxy-watch`. (`galaxy-test-connection`/`galaxy-last-deploy`/
`galaxy-browse`/`stream-alarms` **do** have tests now.)
- `Server-030` (both states in the not-ready diagnostic) **is** done — confirmed at
`src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1676`. The *deferred
follow-up* — should the gateway briefly wait for worker-Ready before failing fast? —
is genuinely unbuilt.
- `Tests-023` extracted a canonical `TestSupport/FakeWorkerProcess(int)`, yet three test
files still define private nested copies.
So §8's gap is real and current.
---
## Workstreams
Four independently landable workstreams.
| WS | Title | Files (language) | Classification | Depends on |
|----|-------|------------------|----------------|------------|
| A | Synchronous Java CLI tests (7 commands) | Java CLI test | small | — |
| B | In-process gRPC harness + streaming/galaxy CLI tests (3 commands) | Java CLI test + small CLI seam | standard | A (shares test file) |
| C | Worker-Ready bounded ready-wait | C# server session hot path | high-risk | — |
| D | `FakeWorkerProcess` consolidation | C# tests | small | — |
A, C, D are mutually independent (disjoint files/languages) and may be dispatched in
parallel. B follows A because both edit `MxGatewayCliTests.java`.
---
## WS-A — Synchronous Java CLI tests
**What:** Round-trip CLI tests for the 7 commands testable through the existing
`FakeSession`/`FakeClient` seam (the same seam `subscribe-bulk`/`write` already use):
`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`,
`bench-read-bulk`, `close-session`.
**How:** Upgrade `FakeSession` (currently returns empty lists) to per-call recorders
that capture the parsed entries (timeout, typed values via the shared `parseValue(type,
text)` switch, user-ids, timestamp) and synthesize one `BulkReadResult`/`BulkWriteResult`
per requested handle, so JSON-shape assertions exercise the
`bulkReadResultMap`/`bulkWriteResultMap` serializers. One `@Test` per command:
- `read-bulk`: `--timeout-ms` reaches session; JSON carries `tagAddress`/`itemHandle`/
`wasCached`/`quality`.
- `write-bulk`: `--type int32 --values 111,222 --user-id 5` parses through `parseValue`;
entries built with the expected typed `MxValue` + `userId`.
- `write2-bulk`: `--timestamp …Z` reaches the entry as `timestampValue`
(`hasTimestampValue()` true).
- `write-secured-bulk`: `--current-user-id`/`--verifier-user-id` both propagate.
- `write-secured2-bulk`: timestamp + both user-ids.
- `bench-read-bulk`: 1s steady / 0s warmup; assert cross-language schema keys
(`language=java`, `command=bench-read-bulk`, `totalCalls`, `successfulCalls`,
`failedCalls`, `callsPerSecond`, `latencyMs.p50/p95/p99`).
- `close-session`: `CloseSessionReply` round-trips through `FakeClient`.
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
---
## WS-B — In-process gRPC harness + streaming/galaxy CLI tests
**Why infra is required:** `MxEventStream` and `DeployEventStream` have package-private
constructors; `GalaxyRepositoryClient` is `final` with a static `connect()` and
`GalaxyCommand` has **no** injectable factory. None of `stream-events`/`galaxy-watch`/
`galaxy-discover` can be faked through the `FakeSession` seam.
**What:** A JUnit fixture that starts a gRPC **`InProcessServer`** hosting scripted
`MxAccessGateway` + `GalaxyRepository` service implementations and exposes an in-process
`Channel`. The **real** `MxGatewayClient`/`GalaxyRepositoryClient` connect to it, so the
real `MxEventStream`/`DeployEventStream` queue-draining and `GalaxyRepositoryClient`
paging are exercised end-to-end (highest fidelity; no reflection, no package hacks).
- **Production change (CLI module only, not the library):** add a `GalaxyClientFactory`
seam to `GalaxyCommand` mirroring the existing `MxGatewayCliClientFactory`, so galaxy
commands can target the in-process channel.
- `stream-events`: server streams a scripted `MxEvent` sequence → assert CLI render,
including the unsigned-uint64 worker-sequence regression.
- `galaxy-watch`: server streams scripted deploy events → assert CLI feed output.
- `galaxy-discover`: server returns a paged `GalaxyObject` hierarchy → assert CLI JSON.
The 7 synchronous commands stay on the lightweight `FakeSession` seam (YAGNI — no reason
to route them through a server).
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
---
## WS-C — Worker-Ready bounded ready-wait
**Problem:** `GetReadyWorkerClient` (`GatewaySession.cs:1665`) fails fast when the session
is `Ready` but the worker client's `WorkerClientState` has diverged (`Handshaking` after a
heartbeat blip, etc.). The both-states diagnostic exists; a brief wait does not.
**Constraint:** the check runs inside the `_syncRoot` lock — we cannot sleep/poll there.
**Design (pinned decisions):**
- New `GetReadyWorkerClientAsync`: read state under `_syncRoot`; **if** session is `Ready`
but worker is **transient** (`Handshaking`/`Created`), release the lock, poll at a short
interval (e.g. 25 ms) until the worker reaches `Ready` or a bounded timeout elapses, then
re-check under the lock.
- **Terminal worker states (`Faulted`/`Closing`/`Closed`/null) fail fast immediately** —
never wait; retrying a faulted worker is pointless and would mask the fault.
- New config `MxGateway:Sessions:WorkerReadyWaitTimeout` on `GatewaySessionOptions`,
**default `0` = disabled** (preserves today's exact fail-fast behavior unless opted in),
validated `>= 0` by the options validator. Document in `docs/GatewayConfiguration.md`.
- The both-states diagnostic is preserved for the final failure. Callers at
`GatewaySession.cs:918` and `:1263` become `await`.
**Tests:**
- Handshaking→Ready within the timeout succeeds (worker invoked once).
- Faulted fails fast with both states in the message, zero waiting.
- Timeout elapses → fails with both states.
- Default `0` → unchanged fail-fast (no wait, no behavior change).
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~SessionManager"`
(plus the options-validator test class).
---
## WS-D — `FakeWorkerProcess` consolidation
**What:** Replace the private nested `FakeWorkerProcess` in
`SessionWorkerClientFactoryFakeWorkerTests`, `WorkerProcessLauncherTests`, and
`WorkerClientTests` with the canonical `TestSupport/FakeWorkerProcess(int)` (which already
has `MarkExited`/`Kill`/TCS-backed `WaitForExitAsync`). Where a nested copy carries extra
behavior the canonical lacks, fold that into the canonical first, then delete the copies.
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~WorkerClient | FullyQualifiedName~WorkerProcessLauncher | FullyQualifiedName~SessionWorkerClientFactory"`.
---
## Testing & sequencing
Per the targeted-test rule in `CLAUDE.md` (Source Update Workflow): each task runs only
its own filtered tests. Run the full gateway suite at most once, after WS-C + WS-D land.
Out-of-scope items remain recorded in `stillpending.md` (vendor/rig-gated) and the
session-resilience epic (`oldtasks.md`).