docs: design for stillpending §8 completion (Approach C)
Also codify targeted-test-per-task rule in CLAUDE.md Source Update Workflow.
This commit is contained in:
@@ -85,6 +85,8 @@ powershell -ExecutionPolicy Bypass -File scripts/run-client-e2e-tests.ps1
|
||||
|
||||
When source code changes, build and test the affected component before reporting work done. If the change crosses component boundaries, build each affected component — don't rely on a single top-level build:
|
||||
|
||||
**Run targeted tests per task, never the full suite each time.** When executing a plan task-by-task, run only the tests that exercise the code that task touched (`dotnet test --filter "FullyQualifiedName~<TestClass>"`, or the per-task test named in the plan). The full gateway suite is slow and leaves orphaned testhost processes — run it at most once per phase (after a related batch of tasks lands), not after every task.
|
||||
|
||||
| Changed area | Required verification |
|
||||
|---|---|
|
||||
| Contracts or `.proto` files | regenerate generated code, then build gateway, worker, and every generated client touched by the contract |
|
||||
|
||||
@@ -0,0 +1,171 @@
|
||||
# Still-Pending §8 Completion — Design
|
||||
|
||||
> **Status:** Approved 2026-06-16. Next step: `superpowers-extended-cc:writing-plans`.
|
||||
|
||||
**Goal:** Close the actionable items in `stillpending.md` §8 ("Deferred test-coverage
|
||||
follow-ups, never filed as findings") — the only Bucket-A work that is neither
|
||||
vendor-gated nor live-rig-gated and is not already covered by the session-resilience
|
||||
epic plan.
|
||||
|
||||
**Scope decision:** Bucket A only (actionable code/test work). The session-resilience
|
||||
epic (Tasks 13–28) is already planned in `docs/plans/2026-06-15-session-resilience.md`
|
||||
and is explicitly **out of scope** here — resume it separately. Vendor-gated
|
||||
(§1.4/§3.4/§3.5) and live-rig/capture-gated (§1.3/§3.x/§5/§6.1) items cannot be
|
||||
completed from this dev box and are out of scope.
|
||||
|
||||
**Approach:** "C" — the complete option, including new in-process gRPC test
|
||||
infrastructure for the Java streaming/galaxy CLI commands and a full bounded
|
||||
ready-wait in the gateway session hot path.
|
||||
|
||||
---
|
||||
|
||||
## Important correction (verified 2026-06-16)
|
||||
|
||||
The three §8 items cite findings marked **Resolved** in the review backlog, but those
|
||||
resolutions did **not** survive into the current tree:
|
||||
|
||||
- The Java bulk-family CLI tests that `Client.Java-026` (resolved 2026-05-20) describes
|
||||
were written against the old `com.dohertylan.mxgateway` package. After the rename to
|
||||
`com.zb.mom.ww`, the current
|
||||
`clients/java/zb-mom-ww-mxgateway-cli/.../MxGatewayCliTests.java` has **zero** coverage
|
||||
for `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`,
|
||||
`write-secured2-bulk`, `bench-read-bulk`, `stream-events`, `close-session`,
|
||||
`galaxy-discover`, `galaxy-watch`. (`galaxy-test-connection`/`galaxy-last-deploy`/
|
||||
`galaxy-browse`/`stream-alarms` **do** have tests now.)
|
||||
- `Server-030` (both states in the not-ready diagnostic) **is** done — confirmed at
|
||||
`src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1676`. The *deferred
|
||||
follow-up* — should the gateway briefly wait for worker-Ready before failing fast? —
|
||||
is genuinely unbuilt.
|
||||
- `Tests-023` extracted a canonical `TestSupport/FakeWorkerProcess(int)`, yet three test
|
||||
files still define private nested copies.
|
||||
|
||||
So §8's gap is real and current.
|
||||
|
||||
---
|
||||
|
||||
## Workstreams
|
||||
|
||||
Four independently landable workstreams.
|
||||
|
||||
| WS | Title | Files (language) | Classification | Depends on |
|
||||
|----|-------|------------------|----------------|------------|
|
||||
| A | Synchronous Java CLI tests (7 commands) | Java CLI test | small | — |
|
||||
| B | In-process gRPC harness + streaming/galaxy CLI tests (3 commands) | Java CLI test + small CLI seam | standard | A (shares test file) |
|
||||
| C | Worker-Ready bounded ready-wait | C# server session hot path | high-risk | — |
|
||||
| D | `FakeWorkerProcess` consolidation | C# tests | small | — |
|
||||
|
||||
A, C, D are mutually independent (disjoint files/languages) and may be dispatched in
|
||||
parallel. B follows A because both edit `MxGatewayCliTests.java`.
|
||||
|
||||
---
|
||||
|
||||
## WS-A — Synchronous Java CLI tests
|
||||
|
||||
**What:** Round-trip CLI tests for the 7 commands testable through the existing
|
||||
`FakeSession`/`FakeClient` seam (the same seam `subscribe-bulk`/`write` already use):
|
||||
`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`,
|
||||
`bench-read-bulk`, `close-session`.
|
||||
|
||||
**How:** Upgrade `FakeSession` (currently returns empty lists) to per-call recorders
|
||||
that capture the parsed entries (timeout, typed values via the shared `parseValue(type,
|
||||
text)` switch, user-ids, timestamp) and synthesize one `BulkReadResult`/`BulkWriteResult`
|
||||
per requested handle, so JSON-shape assertions exercise the
|
||||
`bulkReadResultMap`/`bulkWriteResultMap` serializers. One `@Test` per command:
|
||||
|
||||
- `read-bulk`: `--timeout-ms` reaches session; JSON carries `tagAddress`/`itemHandle`/
|
||||
`wasCached`/`quality`.
|
||||
- `write-bulk`: `--type int32 --values 111,222 --user-id 5` parses through `parseValue`;
|
||||
entries built with the expected typed `MxValue` + `userId`.
|
||||
- `write2-bulk`: `--timestamp …Z` reaches the entry as `timestampValue`
|
||||
(`hasTimestampValue()` true).
|
||||
- `write-secured-bulk`: `--current-user-id`/`--verifier-user-id` both propagate.
|
||||
- `write-secured2-bulk`: timestamp + both user-ids.
|
||||
- `bench-read-bulk`: 1s steady / 0s warmup; assert cross-language schema keys
|
||||
(`language=java`, `command=bench-read-bulk`, `totalCalls`, `successfulCalls`,
|
||||
`failedCalls`, `callsPerSecond`, `latencyMs.p50/p95/p99`).
|
||||
- `close-session`: `CloseSessionReply` round-trips through `FakeClient`.
|
||||
|
||||
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
|
||||
|
||||
---
|
||||
|
||||
## WS-B — In-process gRPC harness + streaming/galaxy CLI tests
|
||||
|
||||
**Why infra is required:** `MxEventStream` and `DeployEventStream` have package-private
|
||||
constructors; `GalaxyRepositoryClient` is `final` with a static `connect()` and
|
||||
`GalaxyCommand` has **no** injectable factory. None of `stream-events`/`galaxy-watch`/
|
||||
`galaxy-discover` can be faked through the `FakeSession` seam.
|
||||
|
||||
**What:** A JUnit fixture that starts a gRPC **`InProcessServer`** hosting scripted
|
||||
`MxAccessGateway` + `GalaxyRepository` service implementations and exposes an in-process
|
||||
`Channel`. The **real** `MxGatewayClient`/`GalaxyRepositoryClient` connect to it, so the
|
||||
real `MxEventStream`/`DeployEventStream` queue-draining and `GalaxyRepositoryClient`
|
||||
paging are exercised end-to-end (highest fidelity; no reflection, no package hacks).
|
||||
|
||||
- **Production change (CLI module only, not the library):** add a `GalaxyClientFactory`
|
||||
seam to `GalaxyCommand` mirroring the existing `MxGatewayCliClientFactory`, so galaxy
|
||||
commands can target the in-process channel.
|
||||
- `stream-events`: server streams a scripted `MxEvent` sequence → assert CLI render,
|
||||
including the unsigned-uint64 worker-sequence regression.
|
||||
- `galaxy-watch`: server streams scripted deploy events → assert CLI feed output.
|
||||
- `galaxy-discover`: server returns a paged `GalaxyObject` hierarchy → assert CLI JSON.
|
||||
|
||||
The 7 synchronous commands stay on the lightweight `FakeSession` seam (YAGNI — no reason
|
||||
to route them through a server).
|
||||
|
||||
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
|
||||
|
||||
---
|
||||
|
||||
## WS-C — Worker-Ready bounded ready-wait
|
||||
|
||||
**Problem:** `GetReadyWorkerClient` (`GatewaySession.cs:1665`) fails fast when the session
|
||||
is `Ready` but the worker client's `WorkerClientState` has diverged (`Handshaking` after a
|
||||
heartbeat blip, etc.). The both-states diagnostic exists; a brief wait does not.
|
||||
|
||||
**Constraint:** the check runs inside the `_syncRoot` lock — we cannot sleep/poll there.
|
||||
|
||||
**Design (pinned decisions):**
|
||||
|
||||
- New `GetReadyWorkerClientAsync`: read state under `_syncRoot`; **if** session is `Ready`
|
||||
but worker is **transient** (`Handshaking`/`Created`), release the lock, poll at a short
|
||||
interval (e.g. 25 ms) until the worker reaches `Ready` or a bounded timeout elapses, then
|
||||
re-check under the lock.
|
||||
- **Terminal worker states (`Faulted`/`Closing`/`Closed`/null) fail fast immediately** —
|
||||
never wait; retrying a faulted worker is pointless and would mask the fault.
|
||||
- New config `MxGateway:Sessions:WorkerReadyWaitTimeout` on `GatewaySessionOptions`,
|
||||
**default `0` = disabled** (preserves today's exact fail-fast behavior unless opted in),
|
||||
validated `>= 0` by the options validator. Document in `docs/GatewayConfiguration.md`.
|
||||
- The both-states diagnostic is preserved for the final failure. Callers at
|
||||
`GatewaySession.cs:918` and `:1263` become `await`.
|
||||
|
||||
**Tests:**
|
||||
- Handshaking→Ready within the timeout succeeds (worker invoked once).
|
||||
- Faulted fails fast with both states in the message, zero waiting.
|
||||
- Timeout elapses → fails with both states.
|
||||
- Default `0` → unchanged fail-fast (no wait, no behavior change).
|
||||
|
||||
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~SessionManager"`
|
||||
(plus the options-validator test class).
|
||||
|
||||
---
|
||||
|
||||
## WS-D — `FakeWorkerProcess` consolidation
|
||||
|
||||
**What:** Replace the private nested `FakeWorkerProcess` in
|
||||
`SessionWorkerClientFactoryFakeWorkerTests`, `WorkerProcessLauncherTests`, and
|
||||
`WorkerClientTests` with the canonical `TestSupport/FakeWorkerProcess(int)` (which already
|
||||
has `MarkExited`/`Kill`/TCS-backed `WaitForExitAsync`). Where a nested copy carries extra
|
||||
behavior the canonical lacks, fold that into the canonical first, then delete the copies.
|
||||
|
||||
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~WorkerClient | FullyQualifiedName~WorkerProcessLauncher | FullyQualifiedName~SessionWorkerClientFactory"`.
|
||||
|
||||
---
|
||||
|
||||
## Testing & sequencing
|
||||
|
||||
Per the targeted-test rule in `CLAUDE.md` (Source Update Workflow): each task runs only
|
||||
its own filtered tests. Run the full gateway suite at most once, after WS-C + WS-D land.
|
||||
|
||||
Out-of-scope items remain recorded in `stillpending.md` (vendor/rig-gated) and the
|
||||
session-resilience epic (`oldtasks.md`).
|
||||
Reference in New Issue
Block a user