e2e: drive each client CLI through one long-lived batch process

The cross-language e2e matrix spawned one CLI process per operation — ~250 per client — paying a process (and, for the Java CLI, a full JVM) cold-start every time. The Java leg alone ran ~16 minutes. Each client CLI (dotnet, go, rust, python, java) gains a `batch` subcommand: a single process that reads one command line from stdin, runs it through the normal subcommand dispatch, writes the JSON result, then a line containing exactly `__MXGW_BATCH_EOR__`. A failing command writes its `{"error":...}` envelope and the loop continues. run-client-e2e-tests.ps1 now launches one batch process per client and pings every operation through its stdin/stdout, so startup is paid once per client. The orchestration and assertions are unchanged; the parity and auth phases now read the `{"error":...}` envelope instead of a process exit code. Full 5-client matrix with -VerifyWrite: ~15 min, down from ~35; the Java leg dropped from ~16 min to ~2-3. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-21 06:20:13 -04:00
parent c1ff8c94e8
commit 6126099cdb
10 changed files with 970 additions and 47 deletions
@@ -294,15 +294,22 @@ path and writes a JSON report under `artifacts/e2e/`:
   write support (`MxAccessCommandExecutor` returning `InvalidRequest` for
   `Write`/`Write2`/`WriteSecured`/`WriteSecured2`).

+Each client CLI is driven through one long-lived `batch` process. Every CLI
+exposes a `batch` subcommand: a process that reads one command line from stdin,
+runs it through the normal subcommand dispatch, writes the JSON result, then a
+line containing exactly `__MXGW_BATCH_EOR__`. The harness launches one such
+process per client and pings the ~250 operations of the flow through it, so the
+process — and, for the JVM, the runtime — cold-start is paid once per client
+instead of once per operation. A command that fails inside the batch process
+writes its `{"error":...}` envelope and the loop continues; the harness treats
+that envelope as the operation failure (used by the parity and auth phases).
+
 Before the per-client phases run, the script builds the .NET CLI
 (`dotnet build`) and installs the Java CLI (`gradle :mxgateway-cli:installDist`)
-once, then invokes the compiled artifacts directly. The matrix issues several
-hundred CLI calls per client; invoking `dotnet run` / `gradle
-:mxgateway-cli:run` per call rebuilds and cold-starts the toolchain every time,
-which stretches the add-item/advise loop long enough for the worker event
-channel to overflow under `FailFast` backpressure. The Go, Rust, and Python
-clients still build on demand (`go run` / `cargo run` / `python -m`) because
-their per-call startup is already sub-second.
+once, so the `batch` process launches straight from the compiled exe / the
+installed launcher. The Go, Rust, and Python batch processes are launched via
+`go run` / `cargo run` / `python -m`, which compile-or-start once when that
+single per-client process starts.

 Build the gateway and worker, start the gateway, and provide a valid API key
 before running the client e2e script: