Files
mxaccessgw/docs/plans/2026-06-16-stillpending-section8-design.md
T
Joseph Doherty 6030bfa18e docs: design for stillpending §8 completion (Approach C)
Also codify targeted-test-per-task rule in CLAUDE.md Source Update Workflow.
2026-06-16 16:19:49 -04:00

172 lines
8.6 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Still-Pending §8 Completion — Design
> **Status:** Approved 2026-06-16. Next step: `superpowers-extended-cc:writing-plans`.
**Goal:** Close the actionable items in `stillpending.md` §8 ("Deferred test-coverage
follow-ups, never filed as findings") — the only Bucket-A work that is neither
vendor-gated nor live-rig-gated and is not already covered by the session-resilience
epic plan.
**Scope decision:** Bucket A only (actionable code/test work). The session-resilience
epic (Tasks 1328) is already planned in `docs/plans/2026-06-15-session-resilience.md`
and is explicitly **out of scope** here — resume it separately. Vendor-gated
(§1.4/§3.4/§3.5) and live-rig/capture-gated (§1.3/§3.x/§5/§6.1) items cannot be
completed from this dev box and are out of scope.
**Approach:** "C" — the complete option, including new in-process gRPC test
infrastructure for the Java streaming/galaxy CLI commands and a full bounded
ready-wait in the gateway session hot path.
---
## Important correction (verified 2026-06-16)
The three §8 items cite findings marked **Resolved** in the review backlog, but those
resolutions did **not** survive into the current tree:
- The Java bulk-family CLI tests that `Client.Java-026` (resolved 2026-05-20) describes
were written against the old `com.dohertylan.mxgateway` package. After the rename to
`com.zb.mom.ww`, the current
`clients/java/zb-mom-ww-mxgateway-cli/.../MxGatewayCliTests.java` has **zero** coverage
for `read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`,
`write-secured2-bulk`, `bench-read-bulk`, `stream-events`, `close-session`,
`galaxy-discover`, `galaxy-watch`. (`galaxy-test-connection`/`galaxy-last-deploy`/
`galaxy-browse`/`stream-alarms` **do** have tests now.)
- `Server-030` (both states in the not-ready diagnostic) **is** done — confirmed at
`src/ZB.MOM.WW.MxGateway.Server/Sessions/GatewaySession.cs:1676`. The *deferred
follow-up* — should the gateway briefly wait for worker-Ready before failing fast? —
is genuinely unbuilt.
- `Tests-023` extracted a canonical `TestSupport/FakeWorkerProcess(int)`, yet three test
files still define private nested copies.
So §8's gap is real and current.
---
## Workstreams
Four independently landable workstreams.
| WS | Title | Files (language) | Classification | Depends on |
|----|-------|------------------|----------------|------------|
| A | Synchronous Java CLI tests (7 commands) | Java CLI test | small | — |
| B | In-process gRPC harness + streaming/galaxy CLI tests (3 commands) | Java CLI test + small CLI seam | standard | A (shares test file) |
| C | Worker-Ready bounded ready-wait | C# server session hot path | high-risk | — |
| D | `FakeWorkerProcess` consolidation | C# tests | small | — |
A, C, D are mutually independent (disjoint files/languages) and may be dispatched in
parallel. B follows A because both edit `MxGatewayCliTests.java`.
---
## WS-A — Synchronous Java CLI tests
**What:** Round-trip CLI tests for the 7 commands testable through the existing
`FakeSession`/`FakeClient` seam (the same seam `subscribe-bulk`/`write` already use):
`read-bulk`, `write-bulk`, `write2-bulk`, `write-secured-bulk`, `write-secured2-bulk`,
`bench-read-bulk`, `close-session`.
**How:** Upgrade `FakeSession` (currently returns empty lists) to per-call recorders
that capture the parsed entries (timeout, typed values via the shared `parseValue(type,
text)` switch, user-ids, timestamp) and synthesize one `BulkReadResult`/`BulkWriteResult`
per requested handle, so JSON-shape assertions exercise the
`bulkReadResultMap`/`bulkWriteResultMap` serializers. One `@Test` per command:
- `read-bulk`: `--timeout-ms` reaches session; JSON carries `tagAddress`/`itemHandle`/
`wasCached`/`quality`.
- `write-bulk`: `--type int32 --values 111,222 --user-id 5` parses through `parseValue`;
entries built with the expected typed `MxValue` + `userId`.
- `write2-bulk`: `--timestamp …Z` reaches the entry as `timestampValue`
(`hasTimestampValue()` true).
- `write-secured-bulk`: `--current-user-id`/`--verifier-user-id` both propagate.
- `write-secured2-bulk`: timestamp + both user-ids.
- `bench-read-bulk`: 1s steady / 0s warmup; assert cross-language schema keys
(`language=java`, `command=bench-read-bulk`, `totalCalls`, `successfulCalls`,
`failedCalls`, `callsPerSecond`, `latencyMs.p50/p95/p99`).
- `close-session`: `CloseSessionReply` round-trips through `FakeClient`.
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
---
## WS-B — In-process gRPC harness + streaming/galaxy CLI tests
**Why infra is required:** `MxEventStream` and `DeployEventStream` have package-private
constructors; `GalaxyRepositoryClient` is `final` with a static `connect()` and
`GalaxyCommand` has **no** injectable factory. None of `stream-events`/`galaxy-watch`/
`galaxy-discover` can be faked through the `FakeSession` seam.
**What:** A JUnit fixture that starts a gRPC **`InProcessServer`** hosting scripted
`MxAccessGateway` + `GalaxyRepository` service implementations and exposes an in-process
`Channel`. The **real** `MxGatewayClient`/`GalaxyRepositoryClient` connect to it, so the
real `MxEventStream`/`DeployEventStream` queue-draining and `GalaxyRepositoryClient`
paging are exercised end-to-end (highest fidelity; no reflection, no package hacks).
- **Production change (CLI module only, not the library):** add a `GalaxyClientFactory`
seam to `GalaxyCommand` mirroring the existing `MxGatewayCliClientFactory`, so galaxy
commands can target the in-process channel.
- `stream-events`: server streams a scripted `MxEvent` sequence → assert CLI render,
including the unsigned-uint64 worker-sequence regression.
- `galaxy-watch`: server streams scripted deploy events → assert CLI feed output.
- `galaxy-discover`: server returns a paged `GalaxyObject` hierarchy → assert CLI JSON.
The 7 synchronous commands stay on the lightweight `FakeSession` seam (YAGNI — no reason
to route them through a server).
**Verify:** `gradle :zb-mom-ww-mxgateway-cli:test --tests *MxGatewayCliTests`.
---
## WS-C — Worker-Ready bounded ready-wait
**Problem:** `GetReadyWorkerClient` (`GatewaySession.cs:1665`) fails fast when the session
is `Ready` but the worker client's `WorkerClientState` has diverged (`Handshaking` after a
heartbeat blip, etc.). The both-states diagnostic exists; a brief wait does not.
**Constraint:** the check runs inside the `_syncRoot` lock — we cannot sleep/poll there.
**Design (pinned decisions):**
- New `GetReadyWorkerClientAsync`: read state under `_syncRoot`; **if** session is `Ready`
but worker is **transient** (`Handshaking`/`Created`), release the lock, poll at a short
interval (e.g. 25 ms) until the worker reaches `Ready` or a bounded timeout elapses, then
re-check under the lock.
- **Terminal worker states (`Faulted`/`Closing`/`Closed`/null) fail fast immediately** —
never wait; retrying a faulted worker is pointless and would mask the fault.
- New config `MxGateway:Sessions:WorkerReadyWaitTimeout` on `GatewaySessionOptions`,
**default `0` = disabled** (preserves today's exact fail-fast behavior unless opted in),
validated `>= 0` by the options validator. Document in `docs/GatewayConfiguration.md`.
- The both-states diagnostic is preserved for the final failure. Callers at
`GatewaySession.cs:918` and `:1263` become `await`.
**Tests:**
- Handshaking→Ready within the timeout succeeds (worker invoked once).
- Faulted fails fast with both states in the message, zero waiting.
- Timeout elapses → fails with both states.
- Default `0` → unchanged fail-fast (no wait, no behavior change).
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~SessionManager"`
(plus the options-validator test class).
---
## WS-D — `FakeWorkerProcess` consolidation
**What:** Replace the private nested `FakeWorkerProcess` in
`SessionWorkerClientFactoryFakeWorkerTests`, `WorkerProcessLauncherTests`, and
`WorkerClientTests` with the canonical `TestSupport/FakeWorkerProcess(int)` (which already
has `MarkExited`/`Kill`/TCS-backed `WaitForExitAsync`). Where a nested copy carries extra
behavior the canonical lacks, fold that into the canonical first, then delete the copies.
**Verify:** `dotnet test src/ZB.MOM.WW.MxGateway.Tests --filter "FullyQualifiedName~WorkerClient | FullyQualifiedName~WorkerProcessLauncher | FullyQualifiedName~SessionWorkerClientFactory"`.
---
## Testing & sequencing
Per the targeted-test rule in `CLAUDE.md` (Source Update Workflow): each task runs only
its own filtered tests. Run the full gateway suite at most once, after WS-C + WS-D land.
Out-of-scope items remain recorded in `stillpending.md` (vendor/rig-gated) and the
session-resilience epic (`oldtasks.md`).