20 KiB
Roleplay Engine
Local-first roleplay chat app that treats fiction as a simulation, not a chat log. The LLM is a renderer for structured world state — it does not hold state.
See rp-engine-design.md for the architectural design and docs/plans/2026-04-26-v1-requirements-design.md for the v1 product requirements & behavioral spec. This file is the working summary.
Why this exists
Fixes three failure modes of conventional RP chatbots:
- Memory loss — old context drops as history grows
- Quality decay — bots get terse and generic over long conversations
- Stale state pollution — bots fixate on past props (the "picnic basket" problem)
Hard scope constraints
- Single user, single machine (the user's Mac)
- Max 3 entities per scene:
you+ up to 2 bots (botA,botB) - Chat-only — no voice, no real-time
The 3-entity cap is load-bearing: it makes the relationship graph fully enumerable (6 directed edges + 1 group node). Don't design for N entities.
Architecture
- Mac (always-on): web UI, orchestrator, persistence, event queue, retrieval, prompt construction, all state.
- Inference endpoint: stateless
generate(prompt, params) -> text. Swap implementations behind one interface. The orchestrator never knows which. - Streaming required for UX.
Runtime stack (locked for v1)
- Backend: Python 3.11+ with FastAPI.
- Frontend: server-rendered HTML + HTMX + minimal vanilla JS/CSS. No JS build chain.
- Live updates: SSE per chat. Per-chat
asyncio.Queuepub/sub. Multi-tab sync is a Phase 1 requirement — two browser tabs on the same chat must mirror each other live (streamed tokens, drawer state, edge updates). - Inference backend: Featherless (OpenAI-compatible API).
narrative_model=dphn/Dolphin-Mistral-24B-Venice-Edition(32K ctx, uncensored).classifier_model=NousResearch/Hermes-3-Llama-3.1-8B(128K ctx, uncensored, structured-output reliable). Fallbacks:cognitivecomputations/dolphin-2.9.4-llama3-8b→mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated.
- Token budgets: narrative 8K hard / 6K soft; classifier 4K hard. Trim tiers must / should / nice — never trim must-include.
- OOC marker:
((double parens))(configurable). - Data layout: everything under
<repo>/data/—chat.db,backups/,snapshots/,exports/,config.toml. The whole tree is.gitignored.CHAT_DB_PATHenv var honored as override. - Auth: bind to
127.0.0.1only in v1. No auth.
Behavioral defaults (locked in v1 brainstorm round 2)
- Significance scale: 0=Routine, 1=Notable, 2=Significant, 3=Pivotal. Score-3 turns auto-pin per witness. Drives retrieval ranking, compression, JSON exports.
- Edge updates: per-turn deltas (
affinity_delta,trust_delta,knowledge_facts,last_interaction); per-scene-close summary rewrite. Every mutation goes through the event log asedge_update. - Classifier failure handling: Pydantic-constrained → 1 retry with stricter reminder → schema-default fallback. 10s timeout. Never block the play loop. Refusals trigger fallback-model swap for that one call. Failures logged to
classifier_failurestable. - Activity verbs: open string + classifier-extracted
interruptible,required_attention,expected_duration. Attention is optional free-form; omit from prompt when empty. - Containers: parse-and-extend. Per-chat scoped. Kickoff parse seeds initial; transitions create new.
- Pinning: soft cap 8 / bot. Pivotal (score 3) = auto-pin. Manual pins never auto-evicted.
- Snapshots: periodic every 100 events / 30 min; pre-rewind always. 5 periodic retained; pre-rewind retained 14 days.
- Streaming: Stop button on streaming row; mid-stream disconnect commits partial with
truncated: true; Send disabled mid-stream; multi-tab streaming via per-chat SSE channel. - Display: lightweight markdown;
*action*italic; OOC((parens))shown dimmed/italic, never sent to bot. - Multi-entity defaults (Phase 2): when
chat.guest_bot_id is None, behavior matches Phase 1 single-bot 1:1. With a guest, all 3 entities are present in the prompt, witness writes, and state-update fan-out (6 directed pairs). - Addressee detection: simple substring match (whole-word, case-insensitive) over the user turn's body. If both bot names match or neither does, the host gets the floor.
- Interjection: classifier-driven, conservative bias (default false on classifier failure / refusal / parse error). When the classifier returns true, the addressee speaks first, then the non-addressee may interject in a follow-up turn.
- Per-POV summaries (multi-entity): each present witness with a memory store gets their own per-POV summary on scene close. The summary differs per bot based on persona + their edge to "you". The group node summary is updated alongside.
Core concepts (vocabulary)
- Entity:
you | botA | botB. Has identity (immutable), state (mood/goals/status), activity, per-POV memory. - Container: anything with slots that holds entities (car, booth, room). Has properties (moving, public, audible range). Spatial grounding lives here, separate from the relationship graph.
- Activity record: per-entity live struct — position (container+slot), posture, current action (verb, duration, interruptible, required_attention), holding, attention, status. Always in the prompt as a small structured block.
- Relationship graph: 6 directed edges (asymmetric feelings matter — never collapse to a single shared field) + 1 group node. Edges hold affinity, trust, summary, knowledge-known-about-target, private moments, last-interaction.
- Scene configurations: exactly 4 — solo with botA, solo with botB, all three present, botA+botB without you ("meanwhile…"). Each has a fixed prompt-loading rule.
- Witnessed-by flag: every memory has a 3-bit
[you, botA, botB]mask. A speaker only sees memories where their bit is set. This is the mechanism that prevents bots referencing things they can't know. - Event: scoped lifecycle (
planned | active | completed | cancelled | expired) with its own props, preconditions, on_start/on_complete hooks, significance. Solves the picnic-basket problem — props live and die with the event, only narrative gist promotes to memory. - Active threads: unresolved plot tensions. Sticky in context until resolved/dropped. Cheap, anchor continuity across compressed scenes.
- Scene: closes when container changes meaningfully or significant time passes. Compression boundary.
- Per-POV summary: every witness gets their own record of a closed scene, written from their POV. Different details, different interpretations. This is what gives bots inner lives — never write omniscient narration into per-POV stores.
- Time skip:
elision(skip the boring middle of an in-progress activity) vsjump(next morning, a week later). Skips run intervening events forward, compress, reset landing activity.
What promotes out of an event (and what doesn't)
- Object acquired → inventory
- Knowledge gained → edge
knowledgefield - Relationship change → edge summary
- Everything else stays in the closed event record. The blanket, the basket, the specific sandwich do not become memories. This rule is the whole point — don't bypass it.
Persistence
- SQLite (single file) for everything structured. WAL mode, foreign keys on, each turn in a transaction.
- sqlite-vss or sqlite-vec for embeddings (same DB file). Decide at Phase 4.
- JSON for snapshots, character templates, scene exports.
- No Postgres, Redis, Pinecone, Docker. Single-user; don't over-engineer.
Schema is event-sourced. See design doc § "Persistence Layer" for the full sketch.
Event sourcing — non-negotiable
State is a projection of an append-only event log. State is never mutated directly — append an event, the projector applies it.
Event kinds: user_turn, assistant_turn, time_skip, event_triggered, edge_update, scene_transition, entity_state_change, activity_change.
This buys: free rewind, trivial replay-debugging, schema migrations against the same log, branching ("what if BotA had said yes").
Determinism on replay: LLM calls are nondeterministic. Store the outcome in the event payload — on replay, use the stored outcome. Never re-call the LLM during replay.
Snapshots every N events / M minutes so we don't replay everything on load. Log is source of truth.
Prompt construction
A speaker's prompt is assembled from their edges and their witnessed memories — never the global state. BotA and BotB are effectively two separate agents who happen to share a scene.
Order (for speaker BotA, with you and BotB present):
- BotA identity + current state
- BotA → You edge
- BotA → BotB edge
- Group node (only if all three present)
- World state (time, weather, location)
- Active scene description
- Activity snapshot for all present entities
- Active threads
- Recent dialogue window
- Retrieved memories (top-K, witness-filtered, BotA-owned)
- Currently active events + their props
After every utterance, run a state-update pass on every present entity, not just the speaker. Silent witnesses still update edges.
Memory retrieval
- Always-loaded: pinned, current scene, active threads, recent N scenes (no retrieval).
- Retrieved: top-K vector search over the speaker's memory store, filtered by witness flag, with recency + significance boosts.
- Keep K small. Bloated retrieval poisons the prompt.
- Phase 1: SQLite FTS5 is enough. Vector search comes at Phase 4.
Implementation phases
- Core loop: schema, entities + edges, single container, event log + projector, single-bot conversation, one LLM backend, streaming UI, manual rollback.
- Multi-entity: second bot, group node, scene configs, witness filtering, per-POV memories, activity/containers, scene transitions with compression.
- Events & skips: event queue with triggers, time skips, active threads, significance classifier.
- Polish: vector retrieval, branching, surgical delete + regenerate, snapshots, backups, impact-preview UI for rewinds.
Don't jump phases. Phase 1 must work end-to-end before Phase 2 lands.
Conventions for working in this repo
- Don't bypass the event log. Any state change goes through an event. If you're tempted to UPDATE a row directly, you're doing it wrong.
- Don't collapse directed edges.
botA → botBandbotB → botAare independent. Asymmetry is the point. - Don't promote event props to memory. Only the four promotion categories above survive an event closing.
- Per-POV, not omniscient. When writing scene summaries, write one per witness, from their angle.
- Witness filter every memory read. A bot must never see a memory their bit isn't set on.
- Activity block is always in the prompt. It's the spatial anchor that prevents "leaning on the kitchen counter while in a car" failures.
- Streaming on the inference path; non-blocking bookkeeping (significance classification, embeddings, snapshots) runs while the LLM streams.
- No Docker, no extra services. SQLite + a process. Push back on suggestions to add infrastructure.
Open decisions (deferred — don't pre-decide)
- Token budget strategy (during Phase 1, with real prompts)
- Embedding model (Phase 4)
sqlite-vssvssqlite-vec(Phase 4)- UI framework (local web app / Tauri / Electron / native — TBD)
- Inference hosting (start with a cloud API, re-evaluate later)
- Character template format (during Phase 1)
- Multi-session / multi-character casts: out of scope for v1. Leave cheap schema hooks only.
Phase 1 status
Phase 1 shipped end-to-end across 35 tasks (T0–T35). The single-bot core loop is functional: event log + projector, schema + migrations, settings/bot authoring, kickoff confirm, streaming turns, drawer rendering, regenerate/rewind, scene close + per-POV summaries, significance classifier, snapshots/backups, first-run navigation, and friendly 404/500 pages. 168 tests passing.
Deferred to Phase 2: second bot, group node, scene configurations, witness filtering across multi-entity scenes, activity/containers, scene-transition compression. Phase 3: event queue + triggers, time skips, active threads. Phase 4: vector retrieval, branching, surgical delete + regenerate, impact-preview UI.
Known v1 limitations (read before extending)
- Drawer edits scope: only affinity, significance, and pin can be hand-edited from the drawer. Other v1 fields (knowledge, summary text, traits) are deferred to Phase 1.5.
- Cold-load snapshot path is wired and unit-tested but rarely exercised in dev — long-running sessions are the only realistic trigger.
- WAL sidecar files (
-wal,-shm) are not captured in nightly backups; the nightly snapshot is a fresh.backup()so this is fine for restore but worth knowing if you copy the db file by hand. - HTMX SSE event names may need a version check if you bump the htmx CDN URL in
base.html— the swap targets are name-coupled. - "You" activity rows can linger after
bot_reset(the reset purges the bot's chats and the bot's own activity row but not the "you" row that was associated with those chats). Cosmetic, fixed in Phase 1.5. - Projector replay is non-idempotent for plain
INSERTevents. After appending, callapply_event(conn, event)for the new row only — callingproject(conn)re-runs every handler from scratch and will trip uniqueness or duplicate inserts. - 8-pin auto-cap eviction is FIFO over the auto-pinned set only. Manual pins survive the eviction; this is by design (manual intent > auto-pin signal).
- Regenerate (T29) does not broadcast
turn_htmlover SSE — the page must refresh to show the regenerated turn. Acceptable for v1 single-tab usage; Phase 1.5 should wire the SSE event. - First-run middleware fires only on bare
/and/chats. Sub-paths like/chats/<id>and/chats/<id>/drawerpass through (correct: HTMX partials should not page-redirect, and a deep-link to a missing chat should 404, not redirect mid-setup).
Phase 1.5 cleanup backlog
All items shipped — see Phase 2.5 status below.
Phase 2 status
Phase 2 shipped end-to-end across 13 tasks (T36–T48 wave). The multi-entity surface is functional: chats can host a guest bot, the prompt assembly is guest-aware, post-turn fans out across all directed pairs, and scene close writes a per-POV summary per present witness plus a group_node summary.
- Multi-entity scene support: chats can now have a guest bot (you + host + guest). The 3-entity cap holds. New event kinds:
guest_added,guest_removed,group_node_initialized,group_node_updated. New table:group_node(members, summary, dynamic, threads). - Drawer guest UX: add/remove guest from the drawer side panel. The "have they met?" prose seed is parsed by the
relationship_seedclassifier into inter-bot directed edges (host↔guest). - Multi-entity turn flow:
post_turnassembles narrative with the guest-aware prompt; writes memories for all present bot witnesses; runs state updates for all directed pairs (6 with 3 entities); detects interjections via classifier (default false; the addressee gets the floor first). - Per-POV scene close summaries: each present witness with a memory store gets their own per-POV summary on close;
group_nodesummary updated alongside. - Bot reset cascade: resetting a bot now also clears
chats.guest_bot_idreferences in other chats (root-cause fix for stale-guest references after T47).
Phase 2.5 / 3 backlog
All items shipped — see Phase 2.5 status below.
Phase 2.5 status
Phase 2.5 cleanup shipped end-to-end across 8 tasks (T68–T75). Two CLAUDE.md backlogs (Phase 1.5 cleanup, Phase 2.5/3) are now empty; deferred follow-ups discovered during execution are tracked in a new "Phase 2.6 / 3 backlog" section below.
open_dbwith check_same_thread parameter (T68): refactoredchat/db/connection.pysochat/web/bots.py:get_connno longer duplicates the PRAGMA setup. Default behavior preserved.bot_resetcross-chat cleanup (T69): now purges orphaned "you" activity rows. Note: this also fixed a latent FK constraint crash that was lurking in the projector —activity.container_idis FK-referenced and the prior code would have crashed on any reset of a bot whose chat had a non-NULLcontainer_id"you" activity row. The bug was masked because no prior test seeded such a row.- LLM-merged group meta-summary (T70): replaces Phase 2 T45's naive concat with a classifier merge call. Falls back to the naive concat on classifier failure.
prompt.pypolish (T71): witness role parametric (hostvsguestderived from chat membership); singleACTIVITIES:block with bullet-level trim; NICE trim order kept with documented rationale (greedy cheapest-impact-first beats spec-listing order in practice).- Drawer polish (T72): deferred v1 edits (edge_trust slider, edge_summary textarea, memory pov_summary textarea, knowledge_facts add/remove) + first-meeting gate (Add-guest form disables prose textarea when host→guest edge already exists; "re-seed anyway" toggle re-enables) + witness flag inline-edit (per-memory checkboxes for [you, host, guest] flags). Two new
manual_editprojector branches:edge_knowledge_factandmemory_witness. - Regenerate polish (T73): regenerate now broadcasts
turn_html_replaceover SSE (NEW event distinct fromturn_htmlto avoid breaking the existing append-semantic consumer); regenerate covers interjection turns (re-detects + re-streams or supersedes); defensive stale-guest degrade removed. - Turn-flow polish + addressee service (T74): classifier-based addressee detection (substring helper kept as no-guest fast path); SignificanceJob enqueued for interjection memories; scene-close-on-cancel pinned with comment + regression test (close detection is genuinely user-prose-only); defensive stale-guest degrade removed.
Phase 2.6 / 3 backlog
New follow-ups discovered during Phase 2.5 execution. None are blocking; pick up at any time.
- Frontend handler for
turn_html_replaceSSE event (from T73.1 review): regenerate's backend broadcast lands, but no live tab swaps the regenerated turn until a JS handler is wired. The existingturn_htmlevent uses HTMXsse-swapto append;turn_html_replaceships JSON withsupersedes_idfor replacement semantics. Phase 2.6 should wire the JS to swap the prior turn's DOM node in place. - Cancel/stop hook for in-flight regenerate streams (from T73 review):
post_turnregisters stream tasks in_in_flight_tasksso the user can stop them. Regenerate doesn't. A user clicking "Stop" mid-regenerate has no cancel hook today. - DRY: regenerate vs post_turn (from T73 review): recent-dialogue assembly and prior-edges block are duplicated between
chat/services/regenerate.pyandchat/web/turns.py. Extract to shared helpers analogous to_gather_state_update_inputs. - Sibling-discovery query optimization (from T73 review):
regenerate.py's sibling-assistant-turn lookup scans all non-supersededassistant_turnrows globally. Adding achat_idpredicate via JSON extraction (or a denormalized column) bounds the cost to per-chat scale. _witness_role_fordefensive coding (from T71 review): helper returns"guest"whenhost_bot_id is None, which is wrong for Phase-1 chats. Defensive:return "host" if host_bot_id is None or speaker_bot_id == host_bot_id else "guest". Not exercised by current tests; harden as a precaution.- Confidence type tightening (from T74 review):
chat/services/addressee.py::AddresseeDecision.confidencecould be typed asLiteral["high","medium","low"]for stricter validation. Currentlystrwith a comment. - Scene-close-on-cancel UX revisit: T74.3 pinned the existing behavior (close fires even on cancel). If real play-testing surfaces a regression, revisit.