Adds RoutedLLMClient that dispatches by model name: requests matching
Settings.narrative_model go to Featherless, everything else (classifier
calls, embed) goes to a local MLX server. The local server is
mlx-omni-server (separate venv at .mlx-venv) and exposes the standard
OpenAI surface at http://127.0.0.1:10240/v1.
LocalMLXClient mirrors FeatherlessClient (AsyncOpenAI under the hood)
but with a working embed() — Featherless's /v1/embeddings always
returns 500 with completions_error, so the router unconditionally
sends embed traffic to the local backend.
Production deployment overrides via data/config.toml:
- classifier_model = mlx-community/Hermes-3-Llama-3.1-8B-8bit (~8 GB)
- embedding_model = mlx-community/bge-small-en-v1.5-bf16 (~150 MB,
384 dim — matches existing schema, no migration)
Defaults stay remote / pseudo so fresh installs and tests need no
external infra. Smoke-tested live: classifier returns expected output,
BGE produces correctly-clustering 384-dim vectors (cat-on-mat closer
to cat-on-rug than to quantum-mechanics).
scripts/start_mlx_server.sh starts the daemon (foreground or --daemon).
.mlx-venv/ added to .gitignore.
Suite: 464 passed (was 457 → +7 new across LocalMLXClient + Router).
- .gitignore: add *.egg-info/ so editable installs don't show in git status.
- pyproject.toml: add [build-system] and [tool.setuptools.packages.find]
scoped to chat*, fixing pip install -e . which was failing on data/
auto-discovery.
- CLAUDE.md: add Phase 1.5 cleanup backlog section under Phase 1 status,
capturing the small follow-ups surfaced in implementer reviews
(open_db refactor, regenerate SSE broadcast, you-activity purge,
drawer edits for deferred fields, NICE trim order).
Resolves the open/deferred decisions from the v1 requirements brainstorm:
runtime stack, classifier model, token budgets, OOC marker, data layout.
- Runtime: FastAPI + HTMX + SSE (multi-tab sync is a Phase 1 requirement,
not a polish item). 127.0.0.1 only, no auth in v1.
- Classifier model: NousResearch/Hermes-3-Llama-3.1-8B with documented
fallback chain (dolphin-2.9.4-llama3-8b, Meta-Llama-3.1-8B-abliterated).
- Token budgets: 8K hard / 6K soft for narrative, 4K hard for classifier;
Must/Should/Nice trimming tiers spelled out in §3.2.
- OOC marker locked to ((double parens)), configurable.
- All runtime data lives under <repo>/data/ (DB, backups, snapshots,
exports, config). Tree is gitignored. CHAT_DB_PATH env var honored.
CLAUDE.md and the requirements doc updated to match. Decisions log in
the requirements doc appendix extended with the new locks (#17–21).
- docs/plans/2026-04-26-v1-requirements-design.md captures the v1
product requirements and behavioral spec from the initial brainstorm
(use case, scope, data model, authoring, play loop, memory, time,
rollback, phase cut, non-negotiable rules).
- README.md introduces the project for the gitea repo.
- CLAUDE.md updated to reference the requirements doc.
- .gitignore added for macOS metadata.