Files
chat/CLAUDE.md
T
Joseph Doherty 365dacc0d0 chore: post-Phase-1 cleanup — gitignore, packaging, backlog
- .gitignore: add *.egg-info/ so editable installs don't show in git status.
- pyproject.toml: add [build-system] and [tool.setuptools.packages.find]
  scoped to chat*, fixing pip install -e . which was failing on data/
  auto-discovery.
- CLAUDE.md: add Phase 1.5 cleanup backlog section under Phase 1 status,
  capturing the small follow-ups surfaced in implementer reviews
  (open_db refactor, regenerate SSE broadcast, you-activity purge,
  drawer edits for deferred fields, NICE trim order).
2026-04-26 14:39:10 -04:00

15 KiB
Raw Blame History

Roleplay Engine

Local-first roleplay chat app that treats fiction as a simulation, not a chat log. The LLM is a renderer for structured world state — it does not hold state.

See rp-engine-design.md for the architectural design and docs/plans/2026-04-26-v1-requirements-design.md for the v1 product requirements & behavioral spec. This file is the working summary.

Why this exists

Fixes three failure modes of conventional RP chatbots:

  1. Memory loss — old context drops as history grows
  2. Quality decay — bots get terse and generic over long conversations
  3. Stale state pollution — bots fixate on past props (the "picnic basket" problem)

Hard scope constraints

  • Single user, single machine (the user's Mac)
  • Max 3 entities per scene: you + up to 2 bots (botA, botB)
  • Chat-only — no voice, no real-time

The 3-entity cap is load-bearing: it makes the relationship graph fully enumerable (6 directed edges + 1 group node). Don't design for N entities.

Architecture

  • Mac (always-on): web UI, orchestrator, persistence, event queue, retrieval, prompt construction, all state.
  • Inference endpoint: stateless generate(prompt, params) -> text. Swap implementations behind one interface. The orchestrator never knows which.
  • Streaming required for UX.

Runtime stack (locked for v1)

  • Backend: Python 3.11+ with FastAPI.
  • Frontend: server-rendered HTML + HTMX + minimal vanilla JS/CSS. No JS build chain.
  • Live updates: SSE per chat. Per-chat asyncio.Queue pub/sub. Multi-tab sync is a Phase 1 requirement — two browser tabs on the same chat must mirror each other live (streamed tokens, drawer state, edge updates).
  • Inference backend: Featherless (OpenAI-compatible API).
    • narrative_model = dphn/Dolphin-Mistral-24B-Venice-Edition (32K ctx, uncensored).
    • classifier_model = NousResearch/Hermes-3-Llama-3.1-8B (128K ctx, uncensored, structured-output reliable). Fallbacks: cognitivecomputations/dolphin-2.9.4-llama3-8bmlabonne/Meta-Llama-3.1-8B-Instruct-abliterated.
  • Token budgets: narrative 8K hard / 6K soft; classifier 4K hard. Trim tiers must / should / nice — never trim must-include.
  • OOC marker: ((double parens)) (configurable).
  • Data layout: everything under <repo>/data/chat.db, backups/, snapshots/, exports/, config.toml. The whole tree is .gitignored. CHAT_DB_PATH env var honored as override.
  • Auth: bind to 127.0.0.1 only in v1. No auth.

Behavioral defaults (locked in v1 brainstorm round 2)

  • Significance scale: 0=Routine, 1=Notable, 2=Significant, 3=Pivotal. Score-3 turns auto-pin per witness. Drives retrieval ranking, compression, JSON exports.
  • Edge updates: per-turn deltas (affinity_delta, trust_delta, knowledge_facts, last_interaction); per-scene-close summary rewrite. Every mutation goes through the event log as edge_update.
  • Classifier failure handling: Pydantic-constrained → 1 retry with stricter reminder → schema-default fallback. 10s timeout. Never block the play loop. Refusals trigger fallback-model swap for that one call. Failures logged to classifier_failures table.
  • Activity verbs: open string + classifier-extracted interruptible, required_attention, expected_duration. Attention is optional free-form; omit from prompt when empty.
  • Containers: parse-and-extend. Per-chat scoped. Kickoff parse seeds initial; transitions create new.
  • Pinning: soft cap 8 / bot. Pivotal (score 3) = auto-pin. Manual pins never auto-evicted.
  • Snapshots: periodic every 100 events / 30 min; pre-rewind always. 5 periodic retained; pre-rewind retained 14 days.
  • Streaming: Stop button on streaming row; mid-stream disconnect commits partial with truncated: true; Send disabled mid-stream; multi-tab streaming via per-chat SSE channel.
  • Display: lightweight markdown; *action* italic; OOC ((parens)) shown dimmed/italic, never sent to bot.

Core concepts (vocabulary)

  • Entity: you | botA | botB. Has identity (immutable), state (mood/goals/status), activity, per-POV memory.
  • Container: anything with slots that holds entities (car, booth, room). Has properties (moving, public, audible range). Spatial grounding lives here, separate from the relationship graph.
  • Activity record: per-entity live struct — position (container+slot), posture, current action (verb, duration, interruptible, required_attention), holding, attention, status. Always in the prompt as a small structured block.
  • Relationship graph: 6 directed edges (asymmetric feelings matter — never collapse to a single shared field) + 1 group node. Edges hold affinity, trust, summary, knowledge-known-about-target, private moments, last-interaction.
  • Scene configurations: exactly 4 — solo with botA, solo with botB, all three present, botA+botB without you ("meanwhile…"). Each has a fixed prompt-loading rule.
  • Witnessed-by flag: every memory has a 3-bit [you, botA, botB] mask. A speaker only sees memories where their bit is set. This is the mechanism that prevents bots referencing things they can't know.
  • Event: scoped lifecycle (planned | active | completed | cancelled | expired) with its own props, preconditions, on_start/on_complete hooks, significance. Solves the picnic-basket problem — props live and die with the event, only narrative gist promotes to memory.
  • Active threads: unresolved plot tensions. Sticky in context until resolved/dropped. Cheap, anchor continuity across compressed scenes.
  • Scene: closes when container changes meaningfully or significant time passes. Compression boundary.
  • Per-POV summary: every witness gets their own record of a closed scene, written from their POV. Different details, different interpretations. This is what gives bots inner lives — never write omniscient narration into per-POV stores.
  • Time skip: elision (skip the boring middle of an in-progress activity) vs jump (next morning, a week later). Skips run intervening events forward, compress, reset landing activity.

What promotes out of an event (and what doesn't)

  • Object acquired → inventory
  • Knowledge gained → edge knowledge field
  • Relationship change → edge summary
  • Everything else stays in the closed event record. The blanket, the basket, the specific sandwich do not become memories. This rule is the whole point — don't bypass it.

Persistence

  • SQLite (single file) for everything structured. WAL mode, foreign keys on, each turn in a transaction.
  • sqlite-vss or sqlite-vec for embeddings (same DB file). Decide at Phase 4.
  • JSON for snapshots, character templates, scene exports.
  • No Postgres, Redis, Pinecone, Docker. Single-user; don't over-engineer.

Schema is event-sourced. See design doc § "Persistence Layer" for the full sketch.

Event sourcing — non-negotiable

State is a projection of an append-only event log. State is never mutated directly — append an event, the projector applies it.

Event kinds: user_turn, assistant_turn, time_skip, event_triggered, edge_update, scene_transition, entity_state_change, activity_change.

This buys: free rewind, trivial replay-debugging, schema migrations against the same log, branching ("what if BotA had said yes").

Determinism on replay: LLM calls are nondeterministic. Store the outcome in the event payload — on replay, use the stored outcome. Never re-call the LLM during replay.

Snapshots every N events / M minutes so we don't replay everything on load. Log is source of truth.

Prompt construction

A speaker's prompt is assembled from their edges and their witnessed memories — never the global state. BotA and BotB are effectively two separate agents who happen to share a scene.

Order (for speaker BotA, with you and BotB present):

  1. BotA identity + current state
  2. BotA → You edge
  3. BotA → BotB edge
  4. Group node (only if all three present)
  5. World state (time, weather, location)
  6. Active scene description
  7. Activity snapshot for all present entities
  8. Active threads
  9. Recent dialogue window
  10. Retrieved memories (top-K, witness-filtered, BotA-owned)
  11. Currently active events + their props

After every utterance, run a state-update pass on every present entity, not just the speaker. Silent witnesses still update edges.

Memory retrieval

  • Always-loaded: pinned, current scene, active threads, recent N scenes (no retrieval).
  • Retrieved: top-K vector search over the speaker's memory store, filtered by witness flag, with recency + significance boosts.
  • Keep K small. Bloated retrieval poisons the prompt.
  • Phase 1: SQLite FTS5 is enough. Vector search comes at Phase 4.

Implementation phases

  1. Core loop: schema, entities + edges, single container, event log + projector, single-bot conversation, one LLM backend, streaming UI, manual rollback.
  2. Multi-entity: second bot, group node, scene configs, witness filtering, per-POV memories, activity/containers, scene transitions with compression.
  3. Events & skips: event queue with triggers, time skips, active threads, significance classifier.
  4. Polish: vector retrieval, branching, surgical delete + regenerate, snapshots, backups, impact-preview UI for rewinds.

Don't jump phases. Phase 1 must work end-to-end before Phase 2 lands.

Conventions for working in this repo

  • Don't bypass the event log. Any state change goes through an event. If you're tempted to UPDATE a row directly, you're doing it wrong.
  • Don't collapse directed edges. botA → botB and botB → botA are independent. Asymmetry is the point.
  • Don't promote event props to memory. Only the four promotion categories above survive an event closing.
  • Per-POV, not omniscient. When writing scene summaries, write one per witness, from their angle.
  • Witness filter every memory read. A bot must never see a memory their bit isn't set on.
  • Activity block is always in the prompt. It's the spatial anchor that prevents "leaning on the kitchen counter while in a car" failures.
  • Streaming on the inference path; non-blocking bookkeeping (significance classification, embeddings, snapshots) runs while the LLM streams.
  • No Docker, no extra services. SQLite + a process. Push back on suggestions to add infrastructure.

Open decisions (deferred — don't pre-decide)

  • Token budget strategy (during Phase 1, with real prompts)
  • Embedding model (Phase 4)
  • sqlite-vss vs sqlite-vec (Phase 4)
  • UI framework (local web app / Tauri / Electron / native — TBD)
  • Inference hosting (start with a cloud API, re-evaluate later)
  • Character template format (during Phase 1)
  • Multi-session / multi-character casts: out of scope for v1. Leave cheap schema hooks only.

Phase 1 status

Phase 1 shipped end-to-end across 35 tasks (T0T35). The single-bot core loop is functional: event log + projector, schema + migrations, settings/bot authoring, kickoff confirm, streaming turns, drawer rendering, regenerate/rewind, scene close + per-POV summaries, significance classifier, snapshots/backups, first-run navigation, and friendly 404/500 pages. 168 tests passing.

Deferred to Phase 2: second bot, group node, scene configurations, witness filtering across multi-entity scenes, activity/containers, scene-transition compression. Phase 3: event queue + triggers, time skips, active threads. Phase 4: vector retrieval, branching, surgical delete + regenerate, impact-preview UI.

Known v1 limitations (read before extending)

  • Drawer edits scope: only affinity, significance, and pin can be hand-edited from the drawer. Other v1 fields (knowledge, summary text, traits) are deferred to Phase 1.5.
  • Cold-load snapshot path is wired and unit-tested but rarely exercised in dev — long-running sessions are the only realistic trigger.
  • WAL sidecar files (-wal, -shm) are not captured in nightly backups; the nightly snapshot is a fresh .backup() so this is fine for restore but worth knowing if you copy the db file by hand.
  • HTMX SSE event names may need a version check if you bump the htmx CDN URL in base.html — the swap targets are name-coupled.
  • "You" activity rows can linger after bot_reset (the reset purges the bot's chats and the bot's own activity row but not the "you" row that was associated with those chats). Cosmetic, fixed in Phase 1.5.
  • Projector replay is non-idempotent for plain INSERT events. After appending, call apply_event(conn, event) for the new row only — calling project(conn) re-runs every handler from scratch and will trip uniqueness or duplicate inserts.
  • 8-pin auto-cap eviction is FIFO over the auto-pinned set only. Manual pins survive the eviction; this is by design (manual intent > auto-pin signal).
  • Regenerate (T29) does not broadcast turn_html over SSE — the page must refresh to show the regenerated turn. Acceptable for v1 single-tab usage; Phase 1.5 should wire the SSE event.
  • First-run middleware fires only on bare / and /chats. Sub-paths like /chats/<id> and /chats/<id>/drawer pass through (correct: HTMX partials should not page-redirect, and a deep-link to a missing chat should 404, not redirect mid-setup).

Phase 1.5 cleanup backlog

Small follow-ups identified during Phase 1 reviews. Pick up at any time; none are blocking.

  • open_db refactor. chat/web/bots.py:get_conn() duplicates the context-manager body to add check_same_thread=False. Extend open_db(path, *, check_same_thread=True) and have get_conn call it directly — eliminates the duplicated PRAGMA setup and ensures any future PRAGMA tweak only happens in one place.
  • Regenerate broadcasts turn_html over SSE. Currently a refresh is needed (see T29 limitation above). Mirror the broadcast logic from chat/web/turns.py:post_turn after the new assistant_turn lands.
  • bot_reset purges orphaned "you" activity rows (see limitation above). Either delete activity rows by chat-membership or accept the noise indefinitely; the projection-layer fix is one extra DELETE FROM activity WHERE entity_id='you' AND container_id IN (SELECT id FROM containers WHERE chat_id IN (...)) clause inside _apply_bot_reset.
  • Drawer edits for the deferred v1 fields: edge_trust slider, edge_summary textarea, memory pov_summary textarea, knowledge_facts add/remove. The manual_edit projector already supports edge_trust / edge_summary / memory_pov_summary target_kinds — only the routes are missing. Knowledge_facts needs a new dispatch branch.
  • NICE trim order in prompt assembly drops previous-scene first instead of last (T18 review). Greedy-cuts heuristic vs spec listing order; revisit if v1 play surfaces a real regression.