dohertj2/chat - chat - Gitea: Git with a cup of tea

dohertj2/chat

Fork 0

Commit Graph

Author	SHA1	Message	Date
Joseph Doherty	de7f6624f0	perf: 18s/turn -> 2.5s/turn (SQLite busy_timeout, parallel state pairs, OpenRouter Cerebras-pinned classifier) Four changes that compound: 1) SQLite busy_timeout 5.0s -> 0.1s in chat/db/connection.py. Root cause of the bulk of the slowness. The embedding worker contends for the WAL write lock while the request handler holds an open transaction; conn.execute's busy-wait does NOT release the GIL, so every state_update LLM call after the narrative was silently freezing the asyncio event loop for ~5s. With 0.1s the worker fails fast and logs (already handled), the chat keeps moving, and any missed embedding can be backfilled out of band. Also takes the test suite from ~290s -> 13s as a bonus. 2) Parallel state-update pairs in multi_state_update.py. Each directed (src, tgt) pair becomes a coroutine in asyncio.gather instead of a sequential for-loop. Returned order is preserved. 3) Classifier on OpenRouter, provider-pinned to Cerebras. New prefix-based router: model id with mlx-community/ -> local MLX, model == narrative_model -> narrative remote, else -> classifier remote. Settings.classifier_provider_order populates extra_body for the classifier client only (FeatherlessClient now accepts default_extra_body to merge into every chat.completions.create). Llama-3.1-8B on Cerebras runs at ~423 tok/s, ~10x the default provider. narrative still routes to mistral-nemo:nitro (Friendli). 4) Cap classify max_tokens at 512. A misbehaving classifier (response_format=json_object ignored) could otherwise generate thousands of tokens of prose before classify's JSON validation trips the retry. 512 is generous; usual completions are 50-150. CHAT_LLM_TIMING=1 env var enables per-call timing logs on stderr; zero overhead when unset. Useful for finding the slow link. Suite: 464 passed in 13s (was 290s).	2026-04-27 13:51:27 -04:00
Joseph Doherty	fe9c497038	feat: split classifier + embeddings to local mlx-omni-server, narrative stays on Featherless Adds RoutedLLMClient that dispatches by model name: requests matching Settings.narrative_model go to Featherless, everything else (classifier calls, embed) goes to a local MLX server. The local server is mlx-omni-server (separate venv at .mlx-venv) and exposes the standard OpenAI surface at http://127.0.0.1:10240/v1. LocalMLXClient mirrors FeatherlessClient (AsyncOpenAI under the hood) but with a working embed() — Featherless's /v1/embeddings always returns 500 with completions_error, so the router unconditionally sends embed traffic to the local backend. Production deployment overrides via data/config.toml: - classifier_model = mlx-community/Hermes-3-Llama-3.1-8B-8bit (~8 GB) - embedding_model = mlx-community/bge-small-en-v1.5-bf16 (~150 MB, 384 dim — matches existing schema, no migration) Defaults stay remote / pseudo so fresh installs and tests need no external infra. Smoke-tested live: classifier returns expected output, BGE produces correctly-clustering 384-dim vectors (cat-on-mat closer to cat-on-rug than to quantum-mechanics). scripts/start_mlx_server.sh starts the daemon (foreground or --daemon). .mlx-venv/ added to .gitignore. Suite: 464 passed (was 457 → +7 new across LocalMLXClient + Router).	2026-04-27 12:05:41 -04:00

Author

SHA1

Message

Date

Joseph Doherty

de7f6624f0

perf: 18s/turn -> 2.5s/turn (SQLite busy_timeout, parallel state pairs, OpenRouter Cerebras-pinned classifier)

Four changes that compound:

1) **SQLite busy_timeout 5.0s -> 0.1s** in chat/db/connection.py. Root
   cause of the bulk of the slowness. The embedding worker contends
   for the WAL write lock while the request handler holds an open
   transaction; conn.execute's busy-wait does NOT release the GIL, so
   every state_update LLM call after the narrative was silently
   freezing the asyncio event loop for ~5s. With 0.1s the worker
   fails fast and logs (already handled), the chat keeps moving, and
   any missed embedding can be backfilled out of band. Also takes the
   test suite from ~290s -> 13s as a bonus.

2) **Parallel state-update pairs** in multi_state_update.py. Each
   directed (src, tgt) pair becomes a coroutine in asyncio.gather
   instead of a sequential for-loop. Returned order is preserved.

3) **Classifier on OpenRouter, provider-pinned to Cerebras**. New
   prefix-based router: model id with mlx-community/ -> local MLX,
   model == narrative_model -> narrative remote, else -> classifier
   remote. Settings.classifier_provider_order populates extra_body for
   the classifier client only (FeatherlessClient now accepts
   default_extra_body to merge into every chat.completions.create).
   Llama-3.1-8B on Cerebras runs at ~423 tok/s, ~10x the default
   provider. narrative still routes to mistral-nemo:nitro (Friendli).

4) **Cap classify max_tokens at 512**. A misbehaving classifier
   (response_format=json_object ignored) could otherwise generate
   thousands of tokens of prose before classify's JSON validation
   trips the retry. 512 is generous; usual completions are 50-150.

CHAT_LLM_TIMING=1 env var enables per-call timing logs on stderr;
zero overhead when unset. Useful for finding the slow link.

Suite: 464 passed in 13s (was 290s).

2026-04-27 13:51:27 -04:00

Joseph Doherty

fe9c497038

feat: split classifier + embeddings to local mlx-omni-server, narrative stays on Featherless

Adds RoutedLLMClient that dispatches by model name: requests matching
Settings.narrative_model go to Featherless, everything else (classifier
calls, embed) goes to a local MLX server. The local server is
mlx-omni-server (separate venv at .mlx-venv) and exposes the standard
OpenAI surface at http://127.0.0.1:10240/v1.

LocalMLXClient mirrors FeatherlessClient (AsyncOpenAI under the hood)
but with a working embed() — Featherless's /v1/embeddings always
returns 500 with completions_error, so the router unconditionally
sends embed traffic to the local backend.

Production deployment overrides via data/config.toml:
- classifier_model = mlx-community/Hermes-3-Llama-3.1-8B-8bit (~8 GB)
- embedding_model = mlx-community/bge-small-en-v1.5-bf16 (~150 MB,
  384 dim — matches existing schema, no migration)

Defaults stay remote / pseudo so fresh installs and tests need no
external infra. Smoke-tested live: classifier returns expected output,
BGE produces correctly-clustering 384-dim vectors (cat-on-mat closer
to cat-on-rug than to quantum-mechanics).

scripts/start_mlx_server.sh starts the daemon (foreground or --daemon).
.mlx-venv/ added to .gitignore.

Suite: 464 passed (was 457 → +7 new across LocalMLXClient + Router).

2026-04-27 12:05:41 -04:00

2 Commits