Files

T

Joseph Doherty 70a5ad3ecc docs: add T66-discovered consume_pending_meanwhile_digests backlog item

2026-04-26 21:19:11 -04:00

27 KiB

Raw Blame History

Roleplay Engine

Local-first roleplay chat app that treats fiction as a simulation, not a chat log. The LLM is a renderer for structured world state — it does not hold state.

See rp-engine-design.md for the architectural design and docs/plans/2026-04-26-v1-requirements-design.md for the v1 product requirements & behavioral spec. This file is the working summary.

Why this exists

Fixes three failure modes of conventional RP chatbots:

Memory loss — old context drops as history grows
Quality decay — bots get terse and generic over long conversations
Stale state pollution — bots fixate on past props (the "picnic basket" problem)

Hard scope constraints

Single user, single machine (the user's Mac)
Max 3 entities per scene: you + up to 2 bots (botA, botB)
Chat-only — no voice, no real-time

The 3-entity cap is load-bearing: it makes the relationship graph fully enumerable (6 directed edges + 1 group node). Don't design for N entities.

Architecture

Mac (always-on): web UI, orchestrator, persistence, event queue, retrieval, prompt construction, all state.
Inference endpoint: stateless generate(prompt, params) -> text. Swap implementations behind one interface. The orchestrator never knows which.
Streaming required for UX.

Runtime stack (locked for v1)

Backend: Python 3.11+ with FastAPI.
Frontend: server-rendered HTML + HTMX + minimal vanilla JS/CSS. No JS build chain.
Live updates: SSE per chat. Per-chat asyncio.Queue pub/sub. Multi-tab sync is a Phase 1 requirement — two browser tabs on the same chat must mirror each other live (streamed tokens, drawer state, edge updates).
Inference backend: Featherless (OpenAI-compatible API).
- narrative_model = dphn/Dolphin-Mistral-24B-Venice-Edition (32K ctx, uncensored).
- classifier_model = NousResearch/Hermes-3-Llama-3.1-8B (128K ctx, uncensored, structured-output reliable). Fallbacks: cognitivecomputations/dolphin-2.9.4-llama3-8b → mlabonne/Meta-Llama-3.1-8B-Instruct-abliterated.
Token budgets: narrative 8K hard / 6K soft; classifier 4K hard. Trim tiers must / should / nice — never trim must-include.
OOC marker: ((double parens)) (configurable).
Data layout: everything under <repo>/data/ — chat.db, backups/, snapshots/, exports/, config.toml. The whole tree is .gitignored. CHAT_DB_PATH env var honored as override.
Auth: bind to 127.0.0.1 only in v1. No auth.

Behavioral defaults (locked in v1 brainstorm round 2)

Significance scale: 0=Routine, 1=Notable, 2=Significant, 3=Pivotal. Score-3 turns auto-pin per witness. Drives retrieval ranking, compression, JSON exports.
Edge updates: per-turn deltas (affinity_delta, trust_delta, knowledge_facts, last_interaction); per-scene-close summary rewrite. Every mutation goes through the event log as edge_update.
Classifier failure handling: Pydantic-constrained → 1 retry with stricter reminder → schema-default fallback. 10s timeout. Never block the play loop. Refusals trigger fallback-model swap for that one call. Failures logged to classifier_failures table.
Activity verbs: open string + classifier-extracted interruptible, required_attention, expected_duration. Attention is optional free-form; omit from prompt when empty.
Containers: parse-and-extend. Per-chat scoped. Kickoff parse seeds initial; transitions create new.
Pinning: soft cap 8 / bot. Pivotal (score 3) = auto-pin. Manual pins never auto-evicted.
Snapshots: periodic every 100 events / 30 min; pre-rewind always. 5 periodic retained; pre-rewind retained 14 days.
Streaming: Stop button on streaming row; mid-stream disconnect commits partial with truncated: true; Send disabled mid-stream; multi-tab streaming via per-chat SSE channel.
Display: lightweight markdown; *action* italic; OOC ((parens)) shown dimmed/italic, never sent to bot.
Multi-entity defaults (Phase 2): when chat.guest_bot_id is None, behavior matches Phase 1 single-bot 1:1. With a guest, all 3 entities are present in the prompt, witness writes, and state-update fan-out (6 directed pairs).
Addressee detection: simple substring match (whole-word, case-insensitive) over the user turn's body. If both bot names match or neither does, the host gets the floor.
Interjection: classifier-driven, conservative bias (default false on classifier failure / refusal / parse error). When the classifier returns true, the addressee speaks first, then the non-addressee may interject in a follow-up turn.
Per-POV summaries (multi-entity): each present witness with a memory store gets their own per-POV summary on scene close. The summary differs per bot based on persona + their edge to "you". The group node summary is updated alongside.

Core concepts (vocabulary)

Entity: you | botA | botB. Has identity (immutable), state (mood/goals/status), activity, per-POV memory.
Container: anything with slots that holds entities (car, booth, room). Has properties (moving, public, audible range). Spatial grounding lives here, separate from the relationship graph.
Activity record: per-entity live struct — position (container+slot), posture, current action (verb, duration, interruptible, required_attention), holding, attention, status. Always in the prompt as a small structured block.
Relationship graph: 6 directed edges (asymmetric feelings matter — never collapse to a single shared field) + 1 group node. Edges hold affinity, trust, summary, knowledge-known-about-target, private moments, last-interaction.
Scene configurations: exactly 4 — solo with botA, solo with botB, all three present, botA+botB without you ("meanwhile…"). Each has a fixed prompt-loading rule.
Witnessed-by flag: every memory has a 3-bit [you, botA, botB] mask. A speaker only sees memories where their bit is set. This is the mechanism that prevents bots referencing things they can't know.
Event: scoped lifecycle (planned | active | completed | cancelled | expired) with its own props, preconditions, on_start/on_complete hooks, significance. Solves the picnic-basket problem — props live and die with the event, only narrative gist promotes to memory.
Active threads: unresolved plot tensions. Sticky in context until resolved/dropped. Cheap, anchor continuity across compressed scenes.
Scene: closes when container changes meaningfully or significant time passes. Compression boundary.
Per-POV summary: every witness gets their own record of a closed scene, written from their POV. Different details, different interpretations. This is what gives bots inner lives — never write omniscient narration into per-POV stores.
Time skip: elision (skip the boring middle of an in-progress activity) vs jump (next morning, a week later). Skips run intervening events forward, compress, reset landing activity.

What promotes out of an event (and what doesn't)

Object acquired → inventory
Knowledge gained → edge knowledge field
Relationship change → edge summary
Everything else stays in the closed event record. The blanket, the basket, the specific sandwich do not become memories. This rule is the whole point — don't bypass it.

Persistence

SQLite (single file) for everything structured. WAL mode, foreign keys on, each turn in a transaction.
sqlite-vss or sqlite-vec for embeddings (same DB file). Decide at Phase 4.
JSON for snapshots, character templates, scene exports.
No Postgres, Redis, Pinecone, Docker. Single-user; don't over-engineer.

Schema is event-sourced. See design doc § "Persistence Layer" for the full sketch.

Event sourcing — non-negotiable

State is a projection of an append-only event log. State is never mutated directly — append an event, the projector applies it.

Event kinds: user_turn, assistant_turn, time_skip, event_triggered, edge_update, scene_transition, entity_state_change, activity_change.

This buys: free rewind, trivial replay-debugging, schema migrations against the same log, branching ("what if BotA had said yes").

Determinism on replay: LLM calls are nondeterministic. Store the outcome in the event payload — on replay, use the stored outcome. Never re-call the LLM during replay.

Snapshots every N events / M minutes so we don't replay everything on load. Log is source of truth.

Prompt construction

A speaker's prompt is assembled from their edges and their witnessed memories — never the global state. BotA and BotB are effectively two separate agents who happen to share a scene.

Order (for speaker BotA, with you and BotB present):

BotA identity + current state
BotA → You edge
BotA → BotB edge
Group node (only if all three present)
World state (time, weather, location)
Active scene description
Activity snapshot for all present entities
Active threads
Recent dialogue window
Retrieved memories (top-K, witness-filtered, BotA-owned)
Currently active events + their props

After every utterance, run a state-update pass on every present entity, not just the speaker. Silent witnesses still update edges.

Memory retrieval

Always-loaded: pinned, current scene, active threads, recent N scenes (no retrieval).
Retrieved: top-K vector search over the speaker's memory store, filtered by witness flag, with recency + significance boosts.
Keep K small. Bloated retrieval poisons the prompt.
Phase 1: SQLite FTS5 is enough. Vector search comes at Phase 4.

Implementation phases

Core loop: schema, entities + edges, single container, event log + projector, single-bot conversation, one LLM backend, streaming UI, manual rollback.
Multi-entity: second bot, group node, scene configs, witness filtering, per-POV memories, activity/containers, scene transitions with compression.
Events & skips: event queue with triggers, time skips, active threads, significance classifier.
Polish: vector retrieval, branching, surgical delete + regenerate, snapshots, backups, impact-preview UI for rewinds.

Don't jump phases. Phase 1 must work end-to-end before Phase 2 lands.

Conventions for working in this repo

Don't bypass the event log. Any state change goes through an event. If you're tempted to UPDATE a row directly, you're doing it wrong.
Don't collapse directed edges. botA → botB and botB → botA are independent. Asymmetry is the point.
Don't promote event props to memory. Only the four promotion categories above survive an event closing.
Per-POV, not omniscient. When writing scene summaries, write one per witness, from their angle.
Witness filter every memory read. A bot must never see a memory their bit isn't set on.
Activity block is always in the prompt. It's the spatial anchor that prevents "leaning on the kitchen counter while in a car" failures.
Streaming on the inference path; non-blocking bookkeeping (significance classification, embeddings, snapshots) runs while the LLM streams.
No Docker, no extra services. SQLite + a process. Push back on suggestions to add infrastructure.

Open decisions (deferred — don't pre-decide)

Token budget strategy (during Phase 1, with real prompts)
Embedding model (Phase 4)
sqlite-vss vs sqlite-vec (Phase 4)
UI framework (local web app / Tauri / Electron / native — TBD)
Inference hosting (start with a cloud API, re-evaluate later)
Character template format (during Phase 1)
Multi-session / multi-character casts: out of scope for v1. Leave cheap schema hooks only.

Phase 1 status

Phase 1 shipped end-to-end across 35 tasks (T0–T35). The single-bot core loop is functional: event log + projector, schema + migrations, settings/bot authoring, kickoff confirm, streaming turns, drawer rendering, regenerate/rewind, scene close + per-POV summaries, significance classifier, snapshots/backups, first-run navigation, and friendly 404/500 pages. 168 tests passing.

Deferred to Phase 2: second bot, group node, scene configurations, witness filtering across multi-entity scenes, activity/containers, scene-transition compression. Phase 3: event queue + triggers, time skips, active threads. Phase 4: vector retrieval, branching, surgical delete + regenerate, impact-preview UI.

Known v1 limitations (read before extending)

Drawer edits scope: only affinity, significance, and pin can be hand-edited from the drawer. Other v1 fields (knowledge, summary text, traits) are deferred to Phase 1.5.
Cold-load snapshot path is wired and unit-tested but rarely exercised in dev — long-running sessions are the only realistic trigger.
WAL sidecar files (-wal, -shm) are not captured in nightly backups; the nightly snapshot is a fresh .backup() so this is fine for restore but worth knowing if you copy the db file by hand.
HTMX SSE event names may need a version check if you bump the htmx CDN URL in base.html — the swap targets are name-coupled.
"You" activity rows can linger after bot_reset (the reset purges the bot's chats and the bot's own activity row but not the "you" row that was associated with those chats). Cosmetic, fixed in Phase 1.5.
Projector replay is non-idempotent for plain INSERT events. After appending, call apply_event(conn, event) for the new row only — calling project(conn) re-runs every handler from scratch and will trip uniqueness or duplicate inserts.
8-pin auto-cap eviction is FIFO over the auto-pinned set only. Manual pins survive the eviction; this is by design (manual intent > auto-pin signal).
Regenerate (T29) does not broadcast turn_html over SSE — the page must refresh to show the regenerated turn. Acceptable for v1 single-tab usage; Phase 1.5 should wire the SSE event.
First-run middleware fires only on bare / and /chats. Sub-paths like /chats/<id> and /chats/<id>/drawer pass through (correct: HTMX partials should not page-redirect, and a deep-link to a missing chat should 404, not redirect mid-setup).

Phase 1.5 cleanup backlog

All items shipped — see Phase 2.5 status below.

Phase 2 status

Phase 2 shipped end-to-end across 13 tasks (T36–T48 wave). The multi-entity surface is functional: chats can host a guest bot, the prompt assembly is guest-aware, post-turn fans out across all directed pairs, and scene close writes a per-POV summary per present witness plus a group_node summary.

Multi-entity scene support: chats can now have a guest bot (you + host + guest). The 3-entity cap holds. New event kinds: guest_added, guest_removed, group_node_initialized, group_node_updated. New table: group_node (members, summary, dynamic, threads).
Drawer guest UX: add/remove guest from the drawer side panel. The "have they met?" prose seed is parsed by the relationship_seed classifier into inter-bot directed edges (host↔guest).
Multi-entity turn flow: post_turn assembles narrative with the guest-aware prompt; writes memories for all present bot witnesses; runs state updates for all directed pairs (6 with 3 entities); detects interjections via classifier (default false; the addressee gets the floor first).
Per-POV scene close summaries: each present witness with a memory store gets their own per-POV summary on close; group_node summary updated alongside.
Bot reset cascade: resetting a bot now also clears chats.guest_bot_id references in other chats (root-cause fix for stale-guest references after T47).

Phase 2.5 / 3 backlog

All items shipped — see Phase 2.5 status below.

Phase 2.5 status

Phase 2.5 cleanup shipped end-to-end across 8 tasks (T68–T75). Two CLAUDE.md backlogs (Phase 1.5 cleanup, Phase 2.5/3) are now empty; deferred follow-ups discovered during execution are tracked in a new "Phase 2.6 / 3 backlog" section below.

open_db with check_same_thread parameter (T68): refactored chat/db/connection.py so chat/web/bots.py:get_conn no longer duplicates the PRAGMA setup. Default behavior preserved.
bot_reset cross-chat cleanup (T69): now purges orphaned "you" activity rows. Note: this also fixed a latent FK constraint crash that was lurking in the projector — activity.container_id is FK-referenced and the prior code would have crashed on any reset of a bot whose chat had a non-NULL container_id "you" activity row. The bug was masked because no prior test seeded such a row.
LLM-merged group meta-summary (T70): replaces Phase 2 T45's naive concat with a classifier merge call. Falls back to the naive concat on classifier failure.
prompt.py polish (T71): witness role parametric (host vs guest derived from chat membership); single ACTIVITIES: block with bullet-level trim; NICE trim order kept with documented rationale (greedy cheapest-impact-first beats spec-listing order in practice).
Drawer polish (T72): deferred v1 edits (edge_trust slider, edge_summary textarea, memory pov_summary textarea, knowledge_facts add/remove) + first-meeting gate (Add-guest form disables prose textarea when host→guest edge already exists; "re-seed anyway" toggle re-enables) + witness flag inline-edit (per-memory checkboxes for [you, host, guest] flags). Two new manual_edit projector branches: edge_knowledge_fact and memory_witness.
Regenerate polish (T73): regenerate now broadcasts turn_html_replace over SSE (NEW event distinct from turn_html to avoid breaking the existing append-semantic consumer); regenerate covers interjection turns (re-detects + re-streams or supersedes); defensive stale-guest degrade removed.
Turn-flow polish + addressee service (T74): classifier-based addressee detection (substring helper kept as no-guest fast path); SignificanceJob enqueued for interjection memories; scene-close-on-cancel pinned with comment + regression test (close detection is genuinely user-prose-only); defensive stale-guest degrade removed.

Phase 2.6 / 3 backlog

New follow-ups discovered during Phase 2.5 execution. None are blocking; pick up at any time.

Frontend handler for turn_html_replace SSE event (from T73.1 review): regenerate's backend broadcast lands, but no live tab swaps the regenerated turn until a JS handler is wired. The existing turn_html event uses HTMX sse-swap to append; turn_html_replace ships JSON with supersedes_id for replacement semantics. Phase 2.6 should wire the JS to swap the prior turn's DOM node in place.
Cancel/stop hook for in-flight regenerate streams (from T73 review): post_turn registers stream tasks in _in_flight_tasks so the user can stop them. Regenerate doesn't. A user clicking "Stop" mid-regenerate has no cancel hook today.
DRY: regenerate vs post_turn (from T73 review): recent-dialogue assembly and prior-edges block are duplicated between chat/services/regenerate.py and chat/web/turns.py. Extract to shared helpers analogous to _gather_state_update_inputs.
Sibling-discovery query optimization (from T73 review): regenerate.py's sibling-assistant-turn lookup scans all non-superseded assistant_turn rows globally. Adding a chat_id predicate via JSON extraction (or a denormalized column) bounds the cost to per-chat scale.
_witness_role_for defensive coding (from T71 review): helper returns "guest" when host_bot_id is None, which is wrong for Phase-1 chats. Defensive: return "host" if host_bot_id is None or speaker_bot_id == host_bot_id else "guest". Not exercised by current tests; harden as a precaution.
Confidence type tightening (from T74 review): chat/services/addressee.py::AddresseeDecision.confidence could be typed as Literal["high","medium","low"] for stricter validation. Currently str with a comment.
Scene-close-on-cancel UX revisit: T74.3 pinned the existing behavior (close fires even on cancel). If real play-testing surfaces a regression, revisit.

Phase 3 status

Phase 3 shipped end-to-end across 19 tasks (T49–T67). Events with full lifecycle, time skips, active threads, significance refinements, and meanwhile scenes are functional. Schema baseline is now version 11 (migrations 0009 events, 0010 threads, 0011 meanwhile_scenes). Test count grew from ~247 (Phase 2) to ~315 (+68 new tests across the wave).

Wave 1 — schema + lifecycle handlers (parallel):
- T49 events table + lifecycle handlers (event_planned, event_started, event_completed, event_cancelled, event_expired).
- T50 time_skip event handlers (elision and jump variants).
- T51 threads table + handlers (thread_opened, thread_updated, thread_closed).
Wave 2 — detection / narration services (parallel):
- T52 event-lifecycle detection service (planned→active→completed transitions inferred from narration).
- T53 skip narration service (elision + jump prose).
- T54 synthesized-memories service for jump skips (LLM-summarized intervening time).
- T55 thread-detection service (open/update/close inferred from recent dialogue).
Wave 3 — promotion + ranking (parallel):
- T56 event-completion promotion service (objects → inventory, knowledge → edge knowledge, relationship deltas → edge summary; everything else stays in the closed event).
- T57 significance-aware retrieval ranking — SQL-side SIGNIFICANCE_RANK_BIAS plus the existing Python composite re-rank.
- T58 scene compression keeps key quotes when significance ≥ 2; thread emission piggybacks on scene close.
Wave 4 — drawer UX (single):
- T59 drawer additions: events panel, threads panel, skip controls.
Wave 5a — prompt + turn flow integration (parallel):
- T60 prompt assembly includes active events + open threads in the speaker's prompt.
- T61 turn flow invokes event-detection + completion promotion alongside existing post-turn fan-out.
Wave 5b — natural-language skip surface (single):
- T62 classifier-driven skip command at the user-input layer; shared skip controllers extracted into chat/web/skip.py.
Wave 6a — meanwhile schema (single):
- T63 meanwhile-scene schema + state (scene config 4: host+guest, no "you").
Wave 6b — meanwhile turn flow (parallel):
- T64 meanwhile turn flow (host+guest, no "you" in the prompt or witness writes).
- T65 meanwhile summary digest surfaces to the next "you"-present scene.
Wave 7 — integration + docs (parallel):
- T66 cross-feature integration tests covering events × skips × threads × meanwhile.
- T67 documentation (this section).

Phase 3.5 / 4 backlog

New follow-ups discovered during Phase 3 reviews and execution. None are blocking; pick up at any time.

From T53 review

narrate_skip timeout_s not piped through to client.generate: parameter accepted but ignored. Fix: pass timeout_s=timeout_s to client.generate(**...), or drop the parameter entirely if Featherless's client doesn't honor it.

From T57 review

search_memories docstring should mention SQL-side significance bias: the function docstring still describes only the Python composite re-rank; add a one-line note about SIGNIFICANCE_RANK_BIAS.

From T58 review

Scene close re-close suffix bloat risk: _build_key_quotes_suffix reads from memories.pov_summary. If a scene close runs twice, the second pass would read the rewritten text plus the previous "Key quotes:" suffix and append a second one. Either guard for double-suffix or source quotes from event_log assistant_turn/user_turn text instead.
Thread detection transcript scoping: _read_recent_dialogue returns chat-wide history with no scene_id filter (Phase 1 turns lack one). Feeding chat-wide history to detect_threads will misattribute threads to the closing scene when the scene boundary falls inside the last 50 turns. Scope by scene_id once turns carry it, or by started_at against scene-open timestamp.
Swallowed exceptions in detect_threads try/except: bare Exception swallows programmer errors silently. Log at debug level so silent regressions are recoverable.
Scene close closed_at clock divergence: T58 uses datetime.now(timezone.utc).isoformat() instead of chat-clock time. Diverges from chat-clock semantics elsewhere; revisit if event reconstructions need chat-clock ordering.
Test coverage gaps in T58: no test for 200-char quote truncation; no test for thread_updated/thread_closed candidate paths; no test for the try/except fallback.

From T61 review

Regenerate doesn't roll back lifecycle transitions from superseded turn: event_started/event_completed rows from a superseded turn remain. Phase 3.5 should add a lifecycle-undo step. Caveat: regenerate-after-completion may double-emit promotion artifacts if the new text re-completes the same event.
Asymmetry in event-detection ordering: post_turn runs lifecycle BETWEEN interjection and scene-close; regenerate runs lifecycle at the END. Benign because regenerate has no scene-close path, but worth tidying.

From T62 review

Error-message prefix sniff for 404 vs 400 routing: drawer skip routes use str(exc).startswith("chat not found") to distinguish 404 from 400. Fragile if error wording changes. Use a typed exception subclass.
Skip command bypasses scene close detection: a user typing "fade out, skip an hour" would skip without closing the scene. Acceptable for Phase 3 but worth noting.

From T63 review

participants_json JSON injection (FIXED in T63 but worth noting in backlog as a "double-check other JSON-string-build sites" task): T63 originally used f-string interpolation; fixed to use json.dumps. Audit other state modules for similar patterns.

From T64 review

record_meanwhile_memory and record_turn_memory_for_present share private _write_one_memory helper: minor DRY note; both helpers are similar enough that a unified API with a you_present: bool kwarg might be cleaner long-term.
Stop button cancellation for meanwhile turns: T64 fix-up registered tasks in _in_flight_tasks; verify the /turns/cancel endpoint actually cancels meanwhile streams (the test pins registration but not the cancel-from-route path).

From cross-feature interactions discovered in Wave 6b merge

Cross-feature canned-queue brittleness: meanwhile-scene close test required a canned response for T65's digest call after T64+T65 merge. Future close-path additions will keep extending the queue; consider a structured fixture builder rather than positional canned arrays.

From T66 integration tests

consume_pending_meanwhile_digests is defined but NOT wired into post_turn: the helper lives in chat/services/prompt.py (T65) but chat/web/turns.py never calls it. Meanwhile digests stay pending forever in production. Phase 3.5 should call the helper after the first you-turn following a meanwhile close — probably right after the assistant_turn lands but before the next prompt assembly. Pinned by tests/test_phase3_integration.py::test_meanwhile_close_digest_surfaces_then_consumed which currently calls the helper directly.

Discovered during Phase 3 execution

_witness_role_for defensive host_bot_id is None (carry-over from Phase 2.5 T71 backlog) — still pending.

27 KiB Raw Blame History Unescape Escape