perf: 18s/turn -> 2.5s/turn (SQLite busy_timeout, parallel state pairs, OpenRouter Cerebras-pinned classifier)

Four changes that compound:

1) **SQLite busy_timeout 5.0s -> 0.1s** in chat/db/connection.py. Root
   cause of the bulk of the slowness. The embedding worker contends
   for the WAL write lock while the request handler holds an open
   transaction; conn.execute's busy-wait does NOT release the GIL, so
   every state_update LLM call after the narrative was silently
   freezing the asyncio event loop for ~5s. With 0.1s the worker
   fails fast and logs (already handled), the chat keeps moving, and
   any missed embedding can be backfilled out of band. Also takes the
   test suite from ~290s -> 13s as a bonus.

2) **Parallel state-update pairs** in multi_state_update.py. Each
   directed (src, tgt) pair becomes a coroutine in asyncio.gather
   instead of a sequential for-loop. Returned order is preserved.

3) **Classifier on OpenRouter, provider-pinned to Cerebras**. New
   prefix-based router: model id with mlx-community/ -> local MLX,
   model == narrative_model -> narrative remote, else -> classifier
   remote. Settings.classifier_provider_order populates extra_body for
   the classifier client only (FeatherlessClient now accepts
   default_extra_body to merge into every chat.completions.create).
   Llama-3.1-8B on Cerebras runs at ~423 tok/s, ~10x the default
   provider. narrative still routes to mistral-nemo:nitro (Friendli).

4) **Cap classify max_tokens at 512**. A misbehaving classifier
   (response_format=json_object ignored) could otherwise generate
   thousands of tokens of prose before classify's JSON validation
   trips the retry. 512 is generous; usual completions are 50-150.

CHAT_LLM_TIMING=1 env var enables per-call timing logs on stderr;
zero overhead when unset. Useful for finding the slow link.

Suite: 464 passed in 13s (was 290s).
This commit is contained in:
Joseph Doherty
2026-04-27 13:51:27 -04:00
parent d656ee8805
commit de7f6624f0
9 changed files with 280 additions and 69 deletions
+32 -18
View File
@@ -36,58 +36,72 @@ class _StubClient:
@pytest.mark.asyncio
async def test_router_generate_dispatches_narrative_to_narrative_backend():
async def test_router_generate_routes_remote_model_to_remote_backend():
"""Any model id NOT starting with a local prefix goes to the remote
backend — narrative model, remote classifiers, anything else."""
narrative = _StubClient("narrative")
local = _StubClient("local")
router = RoutedLLMClient(
narrative=narrative,
local=local,
narrative_model="big-model",
narrative_model="provider/big-model",
local_prefixes=("mlx-community/",),
)
out = await router.generate([Message(role="user", content="hi")], model="big-model")
out = await router.generate(
[Message(role="user", content="hi")], model="provider/big-model"
)
assert out == "narrative:big-model"
assert narrative.generate_calls == ["big-model"]
assert out == "narrative:provider/big-model"
assert narrative.generate_calls == ["provider/big-model"]
assert local.generate_calls == []
@pytest.mark.asyncio
async def test_router_generate_dispatches_classifier_to_local_backend():
async def test_router_generate_routes_local_prefix_to_local_backend():
"""Models prefixed with a local prefix (e.g. ``mlx-community/``)
go to the local MLX backend regardless of whether the rest of the
path looks like a remote provider id."""
narrative = _StubClient("narrative")
local = _StubClient("local")
router = RoutedLLMClient(
narrative=narrative,
local=local,
narrative_model="big-model",
narrative_model="provider/big-model",
local_prefixes=("mlx-community/",),
)
out = await router.generate(
[Message(role="user", content="hi")], model="small-model"
[Message(role="user", content="hi")],
model="mlx-community/Hermes-3-Llama-3.1-8B-8bit",
)
assert out == "local:small-model"
assert local.generate_calls == ["small-model"]
assert out == "local:mlx-community/Hermes-3-Llama-3.1-8B-8bit"
assert local.generate_calls == ["mlx-community/Hermes-3-Llama-3.1-8B-8bit"]
assert narrative.generate_calls == []
@pytest.mark.asyncio
async def test_router_stream_dispatches_by_model():
async def test_router_stream_dispatches_by_prefix():
narrative = _StubClient("narrative")
local = _StubClient("local")
router = RoutedLLMClient(
narrative=narrative, local=local, narrative_model="big-model"
narrative=narrative,
local=local,
narrative_model="provider/big-model",
local_prefixes=("mlx-community/",),
)
chunks_big = [c async for c in router.stream(
[Message(role="user", content="hi")], model="big-model"
chunks_remote = [c async for c in router.stream(
[Message(role="user", content="hi")], model="provider/big-model"
)]
chunks_small = [c async for c in router.stream(
[Message(role="user", content="hi")], model="other-model"
chunks_local = [c async for c in router.stream(
[Message(role="user", content="hi")],
model="mlx-community/Hermes-3-Llama-3.1-8B-8bit",
)]
assert chunks_big == ["narrative:big-model"]
assert chunks_small == ["local:other-model"]
assert chunks_remote == ["narrative:provider/big-model"]
assert chunks_local == ["local:mlx-community/Hermes-3-Llama-3.1-8B-8bit"]
@pytest.mark.asyncio