docs: lock remaining v1 design decisions

Resolves the open/deferred decisions from the v1 requirements brainstorm: runtime stack, classifier model, token budgets, OOC marker, data layout. - Runtime: FastAPI + HTMX + SSE (multi-tab sync is a Phase 1 requirement, not a polish item). 127.0.0.1 only, no auth in v1. - Classifier model: NousResearch/Hermes-3-Llama-3.1-8B with documented fallback chain (dolphin-2.9.4-llama3-8b, Meta-Llama-3.1-8B-abliterated). - Token budgets: 8K hard / 6K soft for narrative, 4K hard for classifier; Must/Should/Nice trimming tiers spelled out in §3.2. - OOC marker locked to ((double parens)), configurable. - All runtime data lives under <repo>/data/ (DB, backups, snapshots, exports, config). Tree is gitignored. CHAT_DB_PATH env var honored. CLAUDE.md and the requirements doc updated to match. Decisions log in the requirements doc appendix extended with the new locks (#17–21).
2026-04-26 10:56:51 -04:00
parent 2f94ba7291
commit 5869f1c5ce
3 changed files with 83 additions and 13 deletions
@@ -1 +1,4 @@
 .DS_Store
+
+# v1 runtime data (DB, backups, snapshots, exports, config with secrets)
+data/
@@ -23,9 +23,22 @@ The 3-entity cap is load-bearing: it makes the relationship graph fully enumerab
 ## Architecture

 - **Mac (always-on)**: web UI, orchestrator, persistence, event queue, retrieval, prompt construction, all state.
- **Inference endpoint**: stateless `generate(prompt, params) -> text`. Swap implementations (cloud API, rented GPU, local MLX/llama.cpp) behind one interface. The orchestrator never knows which.
+- **Inference endpoint**: stateless `generate(prompt, params) -> text`. Swap implementations behind one interface. The orchestrator never knows which.
 - Streaming required for UX.

+## Runtime stack (locked for v1)
+
+- **Backend**: Python 3.11+ with **FastAPI**.
+- **Frontend**: server-rendered HTML + **HTMX** + minimal vanilla JS/CSS. No JS build chain.
+- **Live updates**: SSE per chat. Per-chat `asyncio.Queue` pub/sub. Multi-tab sync is a Phase 1 requirement — two browser tabs on the same chat must mirror each other live (streamed tokens, drawer state, edge updates).
+- **Inference backend**: **Featherless** (OpenAI-compatible API).
+  - `narrative_model` = `dphn/Dolphin-Mistral-24B-Venice-Edition` (32K ctx, uncensored).
+  - `classifier_model` = `NousResearch/Hermes-3-Llama-3.1-8B` (128K ctx, uncensored, structured-output reliable). Fallbacks: `cognitivecomputations/dolphin-2.9.4-llama3-8b` → `mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated`.
+- **Token budgets**: narrative 8K hard / 6K soft; classifier 4K hard. Trim tiers must / should / nice — never trim must-include.
+- **OOC marker**: `((double parens))` (configurable).
+- **Data layout**: everything under `<repo>/data/` — `chat.db`, `backups/`, `snapshots/`, `exports/`, `config.toml`. The whole tree is `.gitignore`d. `CHAT_DB_PATH` env var honored as override.
+- **Auth**: bind to `127.0.0.1` only in v1. No auth.
+
 ## Core concepts (vocabulary)

 - **Entity**: `you | botA | botB`. Has identity (immutable), state (mood/goals/status), activity, per-POV memory.
@@ -23,8 +23,9 @@ The LLM is treated as a **renderer** for structured world state, not as the stat
 - **One chat per bot.** A second bot can be added as a *guest* into any chat. Hard cap: **2 bots in any scene**.
 - Explicit / mature content allowed.
 - **Featherless** as the LLM backend over its OpenAI-compatible API. Two model slots:
-  - `narrative_model` — Dolphin-Mistral-24B-Venice (uncensored, narrative-grade).
-  - `classifier_model` — small (~3B-class), TBD at Phase 1 start. Used for parsing, significance, interjection, scene-close detection, state-update passes.
+  - `narrative_model` — `dphn/Dolphin-Mistral-24B-Venice-Edition` (uncensored, narrative-grade). 32K context.
+  - `classifier_model` — `NousResearch/Hermes-3-Llama-3.1-8B` (uncensored, tuned for tool use / structured output). 128K context. Fallback chain if it underperforms on JSON: `cognitivecomputations/dolphin-2.9.4-llama3-8b` → `mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated`.
+  - Classifier is used for: turn parsing (dialogue/action/ooc), kickoff prose parsing, scene-close detection, interjection decisions, significance scoring, state-update extraction, jump-skip memory synthesis.

 ### 2.2 Out of scope (v1)

@@ -57,6 +58,41 @@ The orchestrator never knows which model is in use — only `generate(prompt, pa

 API key handling: keys live in a local config file outside the repository. **Never** commit a key to the repo, paste in chat logs, or include in exports.

+### 3.1 Runtime stack
+
+- **Backend**: Python 3.11+ with **FastAPI** as the HTTP server.
+- **Frontend**: server-rendered HTML + **HTMX** + minimal vanilla JS/CSS. No JS build chain.
+- **Live updates**: Server-Sent Events (SSE) per chat. Server keeps a per-chat in-process pub/sub channel (an `asyncio.Queue` per chat_id). Every browser tab on `/chats/<id>` opens an SSE connection to `/chats/<id>/events`. State changes (new turn, streamed tokens, drawer state, edge updates, scene close) publish to the channel; all subscribed tabs receive the event and HTMX swaps the relevant DOM region.
+- **Multi-tab sync** is a Phase 1 requirement, not a polish item. Two browser tabs open to the same chat must mirror each other in real time. Implications:
+  - In-progress typing is tab-local until submit (no collaborative input in v1).
+  - On reconnect/refresh, the server first sends a "current state" snapshot, then resumes streaming.
+  - The same architecture trivially supports a phone or tablet on the LAN later — bind to `0.0.0.0` + add a shared-secret token if/when desired. Default is `127.0.0.1`, no auth.
+
+### 3.2 Token budgets and trimming tiers
+
+Token accounting via `tiktoken` with the closest cl100k approximation. Mistral and Llama tokenizers diverge ~5%; we accept the drift.
+
+- **Narrative prompt**: 8K hard ceiling, 6K soft target. Leaves ~2-4K headroom for streamed output and avoids long-context performance cliffs. Plenty for our prompt shape.
+- **Classifier prompt**: 4K hard ceiling. Most calls are well under 1K.
+
+When the assembled prompt exceeds the soft target, trim in this order — never trim must-include:
+
+- **MUST-include** (always present):
+  - System message + speaker identity
+  - Speaker's edge to the addressee
+  - Activity snapshot for all present entities
+  - Current scene description
+  - Last 4 turns of dialogue
+- **SHOULD-include** (trim when over budget):
+  - Other edges of the speaker (e.g. speaker → other present)
+  - Group node summary (when applicable)
+  - Active threads
+  - Currently active events + props
+- **NICE-include** (trim first):
+  - Retrieved memories beyond top-2 (drop K=4 to K=2)
+  - Dialogue turns beyond the last 4 (replace older turns with a one-line summary)
+  - Per-POV summary of the previous scene
+
 ## 4. Data Model (top-level entities)

 - **Bot** — top-level persistent unit. Has identity (immutable per session), state (mood/goals/status), per-bot clock, kickoff spec.
@@ -113,7 +149,7 @@ A turn is free-form prose with conventional markers:

 - `*walks over*` — action.
 - Quoted or bare text — dialogue.
- `((double parens))` — out-of-character commentary or meta-instruction. Flagged but not sent to the bot. (Default; configurable before play begins.)
+- `((double parens))` — out-of-character commentary or meta-instruction. Flagged but not sent to the bot. (Default; stored as a config field; the user may change it before play begins.)

 A small classifier call splits the turn into segments tagged `dialogue | action | ooc`. Action segments update the user's activity record.

@@ -236,9 +272,16 @@ Phase 1 has no skips and no events. Time is set at kickoff and stays put unless
 ## 12. Persistence & Ops (v1 defaults)

 - SQLite WAL mode, foreign keys on, transactional turns.
- Single DB file. Default path TBD (likely `~/Library/Application Support/chat/chat.db`).
+- **Project-folder layout** (DB lives inside the repo, gitignored):
+  - DB: `<repo>/data/chat.db`
+  - Backups: `<repo>/data/backups/` (timestamped copies)
+  - Pre-rewind snapshots: `<repo>/data/snapshots/`
+  - Significant-scene JSON exports: `<repo>/data/exports/`
+  - Config: `<repo>/data/config.toml` (holds Featherless API key, model names, OOC marker, K, budget, etc. Gitignored.)
+  - The entire `data/` tree is in `.gitignore` so secrets and state never get committed.
+  - `CHAT_DB_PATH` env var honored as an override if you want to point at a different file (e.g., a backup or a sibling repo's data).
 - **Auto-backup** nightly via launchd. Timestamped copies. Last 14 retained. Pre-rewind snapshots are separate and not pruned.
- **Significant-scene JSON exports** written to a sibling folder when scenes close at significance ≥ 2.
+- **Significant-scene JSON exports** written to `data/exports/` when scenes close at significance ≥ 2.
 - Schema versioned in a `meta` table; migrations applied on startup.

 ## 13. Phase Cut
@@ -290,13 +333,19 @@ Phase 1 has no skips and no events. Time is set at kickoff and stays put unless

 ## 14. Open / Deferred Decisions

- Exact small classifier model name on Featherless (pick at start of Phase 1: cheapest model that's good enough at structured-output classification).
- Token budget tier strategy (must-include / should-include / nice-to-include) — designed against real prompts during Phase 1.
- UI framework — TBD; local web app is the default direction.
- OOC marker (`((parens))` proposed as default; user may change before play begins).
- DB file location.
- Embedding model choice (Phase 4).
- sqlite-vss vs sqlite-vec (Phase 4).
+Resolved by this brainstorm (now reflected in §3 / §6 / §12 above):
+- ~~Classifier model name~~ → `NousResearch/Hermes-3-Llama-3.1-8B`, with documented fallback chain.
+- ~~Token budget tier strategy~~ → §3.2 (8K / 6K narrative, 4K classifier; must / should / nice tiers).
+- ~~UI framework~~ → FastAPI + HTMX + SSE, multi-tab sync as a Phase 1 requirement (§3.1).
+- ~~OOC marker~~ → `((double parens))`, configurable.
+- ~~DB file location~~ → project-folder `<repo>/data/` tree (§12).
+
+Still deferred:
+- **Embedding model** (Phase 4 — pick whatever's cheap and good enough on Featherless or local at the time).
+- **sqlite-vss vs sqlite-vec** (Phase 4 — pick based on the projects' state at the time).
+- **Significance scoring rubric** — what does 0/1/2/3 mean? Drafted during Phase 1 against real scenes.
+- **Activity-record action verbs** — open vocabulary or constrained list? Decided during Phase 1 implementation.
+- **Drawer edit-affordance UX** — which fields editable in v1, which slip to Phase 1.5 / Phase 4.

 ## 15. Non-Negotiables (rules every implementer must respect)

@@ -331,3 +380,8 @@ Phase 1 has no skips and no events. Time is set at kickoff and stays put unless
 | 14 | Model strategy | Small classifier model + large narrative model |
 | 15 | Reset | Full wipe + hard confirm; chat sits ready for kickoff |
 | 16 | Rollback | Rewind + regenerate (with edit-then-regenerate) |
+| 17 | UI framework | FastAPI + HTMX + SSE; multi-tab sync as a Phase 1 requirement |
+| 18 | Classifier model | `NousResearch/Hermes-3-Llama-3.1-8B` (fallbacks: `dolphin-2.9.4-llama3-8b`, `Meta-Llama-3.1-8B-Instruct-abliterated`) |
+| 19 | Token budgets | Narrative 8K hard / 6K soft; classifier 4K hard. Must/Should/Nice tiers per §3.2 |
+| 20 | OOC marker | `((double parens))`, configurable |
+| 21 | DB location | Project-folder `<repo>/data/` tree (DB, backups, snapshots, exports, config). Gitignored. `CHAT_DB_PATH` env var honored |