Resolves the open/deferred decisions from the v1 requirements brainstorm: runtime stack, classifier model, token budgets, OOC marker, data layout. - Runtime: FastAPI + HTMX + SSE (multi-tab sync is a Phase 1 requirement, not a polish item). 127.0.0.1 only, no auth in v1. - Classifier model: NousResearch/Hermes-3-Llama-3.1-8B with documented fallback chain (dolphin-2.9.4-llama3-8b, Meta-Llama-3.1-8B-abliterated). - Token budgets: 8K hard / 6K soft for narrative, 4K hard for classifier; Must/Should/Nice trimming tiers spelled out in §3.2. - OOC marker locked to ((double parens)), configurable. - All runtime data lives under <repo>/data/ (DB, backups, snapshots, exports, config). Tree is gitignored. CHAT_DB_PATH env var honored. CLAUDE.md and the requirements doc updated to match. Decisions log in the requirements doc appendix extended with the new locks (#17–21).
9.3 KiB
Roleplay Engine
Local-first roleplay chat app that treats fiction as a simulation, not a chat log. The LLM is a renderer for structured world state — it does not hold state.
See rp-engine-design.md for the architectural design and docs/plans/2026-04-26-v1-requirements-design.md for the v1 product requirements & behavioral spec. This file is the working summary.
Why this exists
Fixes three failure modes of conventional RP chatbots:
- Memory loss — old context drops as history grows
- Quality decay — bots get terse and generic over long conversations
- Stale state pollution — bots fixate on past props (the "picnic basket" problem)
Hard scope constraints
- Single user, single machine (the user's Mac)
- Max 3 entities per scene:
you+ up to 2 bots (botA,botB) - Chat-only — no voice, no real-time
The 3-entity cap is load-bearing: it makes the relationship graph fully enumerable (6 directed edges + 1 group node). Don't design for N entities.
Architecture
- Mac (always-on): web UI, orchestrator, persistence, event queue, retrieval, prompt construction, all state.
- Inference endpoint: stateless
generate(prompt, params) -> text. Swap implementations behind one interface. The orchestrator never knows which. - Streaming required for UX.
Runtime stack (locked for v1)
- Backend: Python 3.11+ with FastAPI.
- Frontend: server-rendered HTML + HTMX + minimal vanilla JS/CSS. No JS build chain.
- Live updates: SSE per chat. Per-chat
asyncio.Queuepub/sub. Multi-tab sync is a Phase 1 requirement — two browser tabs on the same chat must mirror each other live (streamed tokens, drawer state, edge updates). - Inference backend: Featherless (OpenAI-compatible API).
narrative_model=dphn/Dolphin-Mistral-24B-Venice-Edition(32K ctx, uncensored).classifier_model=NousResearch/Hermes-3-Llama-3.1-8B(128K ctx, uncensored, structured-output reliable). Fallbacks:cognitivecomputations/dolphin-2.9.4-llama3-8b→mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated.
- Token budgets: narrative 8K hard / 6K soft; classifier 4K hard. Trim tiers must / should / nice — never trim must-include.
- OOC marker:
((double parens))(configurable). - Data layout: everything under
<repo>/data/—chat.db,backups/,snapshots/,exports/,config.toml. The whole tree is.gitignored.CHAT_DB_PATHenv var honored as override. - Auth: bind to
127.0.0.1only in v1. No auth.
Core concepts (vocabulary)
- Entity:
you | botA | botB. Has identity (immutable), state (mood/goals/status), activity, per-POV memory. - Container: anything with slots that holds entities (car, booth, room). Has properties (moving, public, audible range). Spatial grounding lives here, separate from the relationship graph.
- Activity record: per-entity live struct — position (container+slot), posture, current action (verb, duration, interruptible, required_attention), holding, attention, status. Always in the prompt as a small structured block.
- Relationship graph: 6 directed edges (asymmetric feelings matter — never collapse to a single shared field) + 1 group node. Edges hold affinity, trust, summary, knowledge-known-about-target, private moments, last-interaction.
- Scene configurations: exactly 4 — solo with botA, solo with botB, all three present, botA+botB without you ("meanwhile…"). Each has a fixed prompt-loading rule.
- Witnessed-by flag: every memory has a 3-bit
[you, botA, botB]mask. A speaker only sees memories where their bit is set. This is the mechanism that prevents bots referencing things they can't know. - Event: scoped lifecycle (
planned | active | completed | cancelled | expired) with its own props, preconditions, on_start/on_complete hooks, significance. Solves the picnic-basket problem — props live and die with the event, only narrative gist promotes to memory. - Active threads: unresolved plot tensions. Sticky in context until resolved/dropped. Cheap, anchor continuity across compressed scenes.
- Scene: closes when container changes meaningfully or significant time passes. Compression boundary.
- Per-POV summary: every witness gets their own record of a closed scene, written from their POV. Different details, different interpretations. This is what gives bots inner lives — never write omniscient narration into per-POV stores.
- Time skip:
elision(skip the boring middle of an in-progress activity) vsjump(next morning, a week later). Skips run intervening events forward, compress, reset landing activity.
What promotes out of an event (and what doesn't)
- Object acquired → inventory
- Knowledge gained → edge
knowledgefield - Relationship change → edge summary
- Everything else stays in the closed event record. The blanket, the basket, the specific sandwich do not become memories. This rule is the whole point — don't bypass it.
Persistence
- SQLite (single file) for everything structured. WAL mode, foreign keys on, each turn in a transaction.
- sqlite-vss or sqlite-vec for embeddings (same DB file). Decide at Phase 4.
- JSON for snapshots, character templates, scene exports.
- No Postgres, Redis, Pinecone, Docker. Single-user; don't over-engineer.
Schema is event-sourced. See design doc § "Persistence Layer" for the full sketch.
Event sourcing — non-negotiable
State is a projection of an append-only event log. State is never mutated directly — append an event, the projector applies it.
Event kinds: user_turn, assistant_turn, time_skip, event_triggered, edge_update, scene_transition, entity_state_change, activity_change.
This buys: free rewind, trivial replay-debugging, schema migrations against the same log, branching ("what if BotA had said yes").
Determinism on replay: LLM calls are nondeterministic. Store the outcome in the event payload — on replay, use the stored outcome. Never re-call the LLM during replay.
Snapshots every N events / M minutes so we don't replay everything on load. Log is source of truth.
Prompt construction
A speaker's prompt is assembled from their edges and their witnessed memories — never the global state. BotA and BotB are effectively two separate agents who happen to share a scene.
Order (for speaker BotA, with you and BotB present):
- BotA identity + current state
- BotA → You edge
- BotA → BotB edge
- Group node (only if all three present)
- World state (time, weather, location)
- Active scene description
- Activity snapshot for all present entities
- Active threads
- Recent dialogue window
- Retrieved memories (top-K, witness-filtered, BotA-owned)
- Currently active events + their props
After every utterance, run a state-update pass on every present entity, not just the speaker. Silent witnesses still update edges.
Memory retrieval
- Always-loaded: pinned, current scene, active threads, recent N scenes (no retrieval).
- Retrieved: top-K vector search over the speaker's memory store, filtered by witness flag, with recency + significance boosts.
- Keep K small. Bloated retrieval poisons the prompt.
- Phase 1: SQLite FTS5 is enough. Vector search comes at Phase 4.
Implementation phases
- Core loop: schema, entities + edges, single container, event log + projector, single-bot conversation, one LLM backend, streaming UI, manual rollback.
- Multi-entity: second bot, group node, scene configs, witness filtering, per-POV memories, activity/containers, scene transitions with compression.
- Events & skips: event queue with triggers, time skips, active threads, significance classifier.
- Polish: vector retrieval, branching, surgical delete + regenerate, snapshots, backups, impact-preview UI for rewinds.
Don't jump phases. Phase 1 must work end-to-end before Phase 2 lands.
Conventions for working in this repo
- Don't bypass the event log. Any state change goes through an event. If you're tempted to UPDATE a row directly, you're doing it wrong.
- Don't collapse directed edges.
botA → botBandbotB → botAare independent. Asymmetry is the point. - Don't promote event props to memory. Only the four promotion categories above survive an event closing.
- Per-POV, not omniscient. When writing scene summaries, write one per witness, from their angle.
- Witness filter every memory read. A bot must never see a memory their bit isn't set on.
- Activity block is always in the prompt. It's the spatial anchor that prevents "leaning on the kitchen counter while in a car" failures.
- Streaming on the inference path; non-blocking bookkeeping (significance classification, embeddings, snapshots) runs while the LLM streams.
- No Docker, no extra services. SQLite + a process. Push back on suggestions to add infrastructure.
Open decisions (deferred — don't pre-decide)
- Token budget strategy (during Phase 1, with real prompts)
- Embedding model (Phase 4)
sqlite-vssvssqlite-vec(Phase 4)- UI framework (local web app / Tauri / Electron / native — TBD)
- Inference hosting (start with a cloud API, re-evaluate later)
- Character template format (during Phase 1)
- Multi-session / multi-character casts: out of scope for v1. Leave cheap schema hooks only.