Files

T

Joseph Doherty a06f90a164 docs: add Phase 4.5 cleanup plan (all 24 backlog items)

16 tasks across 9 waves consolidating all 24 items in CLAUDE.md
Phase 4.5/5 backlog. Mix of:

- Wave 1 (parallel 6-way): trivial polish across 6 different files
- Wave 2 (single): schema migration 0014 (FK CASCADE + memories.event_id)
- Wave 3 (single): drawer bundle (event_id guard + html.escape + modal
  partial + bulk significance re-rate)
- Wave 4 (single): search UX (FTS snippet highlight + deep-link)
- Wave 5 (single): real embedding model swap (LLMClient.embed protocol)
- Wave 6 (single): branching read-side filter (riskiest — cross-cutting)
- Wave 7 (single): regenerate lifecycle rollback
- Wave 8 (single): sqlite-vec swap [ENVIRONMENTAL — may defer to Phase 5
  if Python rebuild / apsw not feasible]
- Wave 9 (parallel 3-way): structured fixture builder + integration tests + docs

Schema baseline 13 -> 14 (or 15 with T115). Big tasks (T112 real embed,
T113 branching filter, T114 lifecycle rollback) advance the engine
beyond Phase 4's metadata-only state. T115 environmental decision
captured in pre-flight; the other 13 tasks ship without it.

Uses task ids T103-T118 to avoid collision with prior phases.

2026-04-27 04:22:08 -04:00

40 KiB

Raw Blame History

Roleplay Engine — Phase 4.5 Cleanup Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for parallel waves.

Goal: Burn down all 24 items in CLAUDE.md §"Phase 4.5 / 5 backlog". Mix of small defensive cleanups (most), three big features (real embedding model swap, branching read-side filter, lifecycle rollback in regenerate), one environment-dependent feature (sqlite-vec swap), and the long-deferred carry-overs (scene-close-on-cancel revisit, structured test-fixture builder).

Architecture: No new architecture. Two new schema migrations (0014 schema polish, 0015 sqlite-vec virtual tables). New external dependency optional (apsw if Python rebuild isn't possible). All other changes are polish / refactor / observability.

Tech Stack:

Existing — same as Phase 4.
OPTIONAL: rebuild Python with --enable-loadable-sqlite-extensions OR install apsw to enable T115 sqlite-vec swap. T115 is the only task that requires this; the other 13 tasks land without it. If neither is available, T115 is deferred to Phase 5.

Source-of-truth references:

Backlog: CLAUDE.md §"Phase 4.5 / 5 backlog" (24 items grouped by review source + deferred).
Phase 3.5 / Phase 2.5 cleanup plans (pattern reference): 2026-04-26-v3.5-phase3.5-cleanup.md, 2026-04-26-v2.5-phase2.5-cleanup.md.
Conventions: CLAUDE.md §"Behavioral defaults" + §"Phase 4 status".

Pre-flight

Branch: create phase-4.5 from the latest main:

git checkout main && git pull && git checkout -b phase-4.5

Schema baseline: Phase 4 leaves the DB at version 13. Phase 4.5 adds two migrations: 0014_phase45_schema.sql (T109) and 0015_vec0_virtual_tables.sql (T115 — only lands if T115 ships). Final schema version: 14 or 15.

Optional pre-flight for T115 (sqlite-vec swap):

The host Python build needs enable_load_extension. Two options:

Rebuild Python via pyenv with PYTHON_CONFIGURE_OPTS="--enable-loadable-sqlite-extensions" pyenv install 3.12.0 --force and recreate the venv.
Add apsw as a dependency and migrate chat/db/connection.py to use apsw.Connection (significant refactor — the entire codebase uses stdlib sqlite3).

If neither is acceptable, defer T115 to Phase 5 and ship Phase 4.5 with 13 tasks instead of 14. The other tasks are unaffected.

Pinned non-negotiables (carried forward):

State changes go through the event log. Use append_and_apply for the live path.
Witness filter every memory read at SQL level.
TDD: every task starts with a failing test (or a regression test pinning existing contract before refactor).
One commit per task minimum. Bundled tasks split internally.

Verification before claiming done: Use superpowers-extended-cc:verification-before-completion — run the test command, paste actual output.

Backlog item → task mapping

24 items consolidated into 14 tasks by file ownership:

#	Item	Source	Task
1	`embeddings` FK lacks `ON DELETE CASCADE`	T88	T109 (schema migration)
2	`list_branches(chat_id=...)` global-branch leak — document	T89	T103
3	Branch-switch silently leaves zero active — log warning	T89	T103
4	Real embedding model swap	T91 / deferred	T112
5	`timeout_s` fallback-path logging	T91	T107
6	Duplicate `MAX(id)` lookup in retrieval ranking	T96	T104
7	`fts_rank=None` for vector-only rows — document	T96	T104
8	`event_id <= 0` guard in `delete_turn`	T98	T110
9	`html.escape()` on delete-impact modal output	T98	T110
10	Extract delete-impact modal to Jinja partial	T98	T110
11	Hoist `datetime`/`timezone` imports in `snapshots.py`	T99	T105
12	Strict `kind` validation in snapshot routes	T99	T105
13	`created_at` from file mtime — document drift risk	T99	T105
14	Hardcoded `k=50` → module constant	T100	T106
15	N+1 lookups in search results	T100	T106
16	FTS highlighting via `snippet()`	T100	T111
17	Result links chat-level only — add deep-link via memories.event_id	T100	T109 + T111
18	sqlite-vec swap when host Python supports loadable extensions	deferred	T115
19	Branching read-side filter (consult `is_active`)	deferred	T113
20	Bulk significance re-rate in drawer	deferred	T110
21	Vector index optimization (HNSW)	deferred	T115 (post-ship note)
22	Scene-close-on-cancel UX revisit	Phase 2.5 carry-over	T108
23	Cross-feature canned-queue brittleness fixture builder	Phase 3 carry-over	T116
24	Full lifecycle-rollback in regenerate	Phase 3.5 carry-over	T114

Parallel-Execution Strategy

Same pattern as Phase 3.5 / Phase 2.5 / Phase 4. Nine waves: parallel within each wave (file-disjoint), serial across waves.

How to dispatch a wave in parallel

Use the Agent tool with isolation: "worktree". (If the controlling session's working directory is not the chat repo, create worktrees manually with git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-4.5.)

After a wave completes

Each subagent returns its worktree path and commit SHA(s).
Run a spec + code-quality reviewer subagent on each completed task. Combined review acceptable for trivial tasks (T103–T108); separate spec + quality reviewers for big tasks (T112, T113, T114, T115).
Merge the wave into phase-4.5 in any order (file-disjointness guarantees no conflict). Use --no-ff.
Run the full test suite on the merged phase-4.5.
Push phase-4.5 to gitea.
Optionally clean up worktrees.

Conflict prevention checklist

For each parallel wave, verify the Files sections of all tasks have no overlapping paths. Hot files in this plan (each owned by exactly one task): chat/state/memory.py, chat/web/drawer.py, chat/web/search.py, chat/services/regenerate.py, chat/services/turn_common.py, chat/services/embeddings.py, chat/db/migrations/.

Why each wave is parallel-safe

Wave	Tasks	Hot files	Disjoint?
1	T103, T104, T105, T106, T107, T108	6 different files; no overlap	✅
2	T109	new migration + minor projector update	(single task)
3	T110	`chat/web/drawer.py` (bundle)	(single task)
4	T111	`chat/services/cross_chat_search.py` + `chat/web/search.py` + template	(single task; depends on T109)
5	T112	`chat/services/embeddings.py` + `chat/llm/*.py` (Protocol + Featherless + Mock)	(single task)
6	T113	`chat/services/turn_common.py` + multiple readers (cross-cutting)	(single task)
7	T114	`chat/services/regenerate.py` + projector handler	(single task)
8	T115	new migration + `chat/services/vector_search.py` + `chat/db/connection.py`	(single task; environmental)
9	T116, T117, T118	new test fixture file (T116); new test file (T117); CLAUDE.md (T118)	✅

Task overview

Wave 1 ─┬─ T103: branches polish (global-branch doc + branch-switch warning)
        ├─ T104: state/memory.py polish (DRY MAX(id) + fts_rank doc)
        ├─ T105: snapshots.py polish (datetime hoist + kind validation + mtime doc)
        ├─ T106: search.py polish (k constant + N+1 batched lookups)
        ├─ T107: embeddings.py timeout_s fallback-path logging
        └─ T108: scene-close-on-cancel UX revisit (pin behavior with regression test)

Wave 2 ─── T109: 0014 schema migration (FK CASCADE + memories.event_id column)

Wave 3 ─── T110: drawer Phase 4.5 bundle (event_id guard + html.escape + modal partial + bulk sig re-rate)

Wave 4 ─── T111: search UX enhancements (FTS snippet() highlighting + deep-link via memories.event_id)

Wave 5 ─── T112: real embedding model swap (LLMClient.embed protocol + Featherless impl + generate_embedding routing + backfill)

Wave 6 ─── T113: branching read-side filter (event readers consult is_active branch range)

Wave 7 ─── T114: regenerate lifecycle rollback (back-reference field + compensating events on supersede)

Wave 8 ─── T115: sqlite-vec swap (vec0 virtual tables + MATCH-based vector_search) [ENVIRONMENTAL — see pre-flight]

Wave 9 ─┬─ T116: structured test-fixture builder (canned-queue brittleness)
        ├─ T117: Phase 4.5 cross-feature integration tests
        └─ T118: docs sweep — Phase 4.5 status, prune backlog, capture Phase 5 residuals

Critical path: 9 sequential merge points. Total tasks: 14 (or 13 if T115 deferred). Parallelism: Waves 1 (6-way) and 9 (3-way) dispatch concurrently. Waves 2–8 are single-task by hot-file constraint.

Wave 1 — Independent small fixes (parallel, 6 tasks)

All trivial, file-disjoint. Each is 1-line + 1-test or similar.

Task 103: branches polish

Files:

Modify: chat/state/branches.py
Modify: tests/test_branches_state.py

Spec (2 sub-fixes, single commit):

Document global-branch leak: list_branches(chat_id=...) filter chat_id = ? OR chat_id IS NULL returns global/null-chat branches (like "main") in every chat scope. Add a docstring note explaining this is intentional ("main" is global by design; per-chat branches are scoped).
Warn on branch-switch to nonexistent name: in _apply_branch_switched, before the SQL UPDATE, check if a branch with the given name exists. If not, emit logging.getLogger(__name__).warning(...) rather than silently leaving zero active branches.

Test: test_branch_switched_unknown_name_warns — capture log via caplog, append branch_switched for nonexistent name, assert warning message + no active branch (existing behavior preserved, just observable).

Commit: chore: branches polish — global-leak docs + unknown-name warning (T103).

Task 104: state/memory.py polish

Files:

Modify: chat/state/memory.py
Modify: tests/test_memory_search.py (no new tests; just add docstring assertions if needed)

Spec (2 sub-fixes):

DRY MAX(id) lookup: _composite_rerank (Phase 3.5 T57) and _rrf_fuse_and_rerank (Phase 4 T96) both query SELECT MAX(id) FROM event_log for the recency boost. Extract a _max_event_id(conn) helper.
fts_rank=None documentation: search_memories docstring should note that vector-only rows have fts_rank=None. Downstream consumers must accept None (they currently do, but contract is implicit).

Test: existing tests cover both via the public API; no new test needed unless docstring assertion is desired.

Commit: chore: memory.py DRY MAX(id) helper + document fts_rank=None contract (T104).

Task 105: snapshots.py polish

Files:

Modify: chat/web/snapshots.py
Modify: tests/test_snapshot_ux.py (1 new test)

Spec (3 sub-fixes):

Hoist datetime/timezone imports to module level (currently inside _list_all_snapshots).
Strict kind validation in restore/preview routes: currently kind defaults to "periodic". If a rewind snapshot is requested without explicit kind, the lookup silently 404s. Reject missing kind with a 400 instead of silently defaulting.
Document created_at mtime drift risk in module docstring: snapshot timestamps come from file mtime, not the encoded filename timestamp. Files copied via cp -p preserve mtime; cp without -p resets it. Add a one-line note.

Test: test_restore_without_kind_returns_400 — POST /snapshots/restore/<id> without kind; assert 400.

Commit: chore: snapshots.py polish — hoisted imports + strict kind + mtime doc (T105).

Task 106: search.py polish

Files:

Modify: chat/web/search.py
Modify: tests/test_search_ux.py (1 new test)

Spec (2 sub-fixes):

Hardcoded k=50 → module constant: extract DEFAULT_SEARCH_K = 50 at module level. Tunable without code change at the call site.
N+1 lookup batching: GET /search?q=... currently calls get_bot(conn, owner_id), get_chat(conn, chat_id), get_scene(conn, scene_id) per result row (worst case 50×3 = 150 individual queries). Batch via WHERE id IN (...) queries: collect distinct ids first, fetch in 3 batched queries, then map back per row.

Test: test_search_results_use_batched_lookups — mock get_bot/get_chat/get_scene and assert each is called once (not per row). OR easier: time the search with 50 results and assert it doesn't degrade linearly with k.

Commit: perf: search.py N+1 batching + k constant extraction (T106).

Task 107: embeddings.py timeout_s fallback-path logging

Files:

Modify: chat/services/embeddings.py
Modify: tests/test_embeddings.py (1 new test)

Spec:

When model != DEFAULT_EMBEDDING_MODEL and falls through to fallback (zero-vector with model="fallback"), log a warning so misconfigured callers (e.g., a Phase 4.5+ caller pointing at a real model that doesn't exist) don't silently degrade.

if model != DEFAULT_EMBEDDING_MODEL:
    _log.warning(
        "generate_embedding: non-default model %r returned fallback "
        "(model client.embed() not yet implemented in Phase 4.5+); "
        "downstream search will degrade silently. Configure a supported model.",
        model,
    )
    return EmbeddingResult(...)  # fallback

The Phase 4 default path (model == DEFAULT_EMBEDDING_MODEL → pseudo-embedding) is silent; only non-default models trigger the warning.

Test: test_generate_embedding_non_default_model_logs_warning — call with model="real-model"; capture log via caplog; assert the warning message appears.

Commit: chore: embeddings.py warns on fallback for non-default models (T107).

Task 108: scene-close-on-cancel UX revisit

Files:

Modify: tests/test_turn_flow.py (extend the existing pin test added in Phase 2.5 T74.3 OR add a new one)
Optionally modify: chat/web/turns.py if a real bug surfaces during investigation

Spec:

This carry-over has been pending since Phase 2.5 T74.3. The pinned behavior: scene close fires even when the primary turn is cancelled mid-stream, because detect_scene_close consults user prose (fully present at cancel time), not bot output.

Action:

Re-investigate by reading the post_turn cancellation path. Confirm the rationale still holds (it should — nothing about the close-detection logic changed in Phase 3 or 4).
Strengthen the regression test in tests/test_turn_flow.py (the existing test_cancelled_turn_still_closes_scene_when_user_prose_signals_close). Add an assertion that the user prose IS present at the moment scene_close_decision fires (even though the bot output isn't).
If investigation surfaces an actual UX issue (e.g., the close fires too eagerly on prose like "fade out... actually wait"), this becomes a real fix — but default action is documentation-only.

Default outcome: add a docstring comment to the post_turn close-detection branch explaining the rationale. No behavioral change.

Test (extend existing): assert ordering — scene_closed event lands AFTER the user_turn event but BEFORE any potential assistant_turn (which is cancelled). Pin the contract.

Commit: chore: scene-close-on-cancel — strengthen regression test + document rationale (T108).

Wave 2 — Schema migration (single)

Task 109: 0014 schema migration

Files:

Create: chat/db/migrations/0014_phase45_schema.sql
Modify: chat/state/memory.py or chat/services/memory_write.py (populate the new event_id column on memory_written)
Modify: tests/test_world.py (bump schema_version assertion to 14)
Modify: tests/test_memory_write.py (assert event_id populated)

Spec:

Two schema changes bundled into a single migration:

embeddings.memory_id FK gets ON DELETE CASCADE (T88 review nit). SQLite doesn't support ALTER TABLE ... ALTER COLUMN, so the standard pattern is: rename old table, create new, copy data, drop old, recreate indices. Alternatively, since this is a new-ish table (Phase 4 added it) and the change is purely defensive, document as "WONTFIX in 4.5; deindex events remain the only deletion path; ON DELETE CASCADE remains a Phase 5 candidate when we do a broader migration cleanup". Choose pragmatically.
Add memories.event_id INTEGER column (NULL allowed for backward compat) referencing event_log.id. This is the foundation for T111's deep-linking from cross-chat search results to specific turns. Migration adds the column; the projector for memory_written populates it from the event id when projecting.

Production code change: in the memory_written projector handler (in chat/state/memory.py or wherever it lives), populate the new event_id column with the projecting event's id. The Event object has id available in the projector context.

Tests:

test_schema_version_after_migration_is_14 (rename + bump from 13).
test_memory_written_populates_event_id — append memory_written; project; query memories table; assert event_id is the projecting event's id.
(Backward compat) older memories from existing seed data have NULL event_id — the column is nullable.

Commit: feat: 0014 schema — embeddings FK CASCADE (deferred or applied) + memories.event_id column (T109).

Wave 3 — Drawer Phase 4.5 bundle (single)

Task 110: drawer polish + bulk significance re-rate

Files:

Modify: chat/web/drawer.py
Modify: chat/templates/_drawer.html
Create: chat/templates/_delete_impact_modal.html (extracted partial)
Modify: chat/state/manual_edit.py (potentially — if bulk re-rate emits a new manual_edit kind)
Modify: tests/test_drawer_phase4.py (extend with 4-5 new tests)

Spec (4 sub-fixes, 4 commits):

event_id <= 0 guard in delete_turn (T98 nit): currently silently rewinds everything if event_id is 0. Add if event_id <= 0: raise HTTPException(400, "...").
html.escape() on delete-impact modal (T98 nit): the rendered HTML in compute_delete_impact output is built via raw f-strings from model-controlled strings. Wrap user-controllable fields with html.escape(). Defense-in-depth — currently safe, but if event payload fields ever appear in descriptions, autoescape would prevent XSS.
Extract delete-impact modal HTML to a Jinja partial: create chat/templates/_delete_impact_modal.html; render via templates.TemplateResponse(...) instead of f-string concatenation. Inherits Jinja2 autoescape automatically. Tests use the existing TestClient pattern.
Bulk significance re-rate (T98.2 deferral): drawer panel showing memory significance distribution per chat. New POST route /chats/{chat_id}/drawer/memory/significance/bulk accepting {level_from, level_to} form fields. Updates ALL memories in the chat at level_from to level_to via a sequence of manual_edit events (one per memory — preserves the audit trail).

Tests:

test_delete_turn_with_event_id_zero_returns_400.
test_delete_impact_modal_uses_jinja_partial (assert response renders the partial template; verify with assert b"<div class=\"delete-impact-modal\">" in response.content or similar).
test_delete_impact_modal_escapes_user_controllable_strings — seed an event with a payload containing <script> in a description-bound field; render preview; assert it appears HTML-escaped.
test_bulk_significance_re_rate_emits_manual_edit_per_memory — seed 5 memories at significance 0; bulk re-rate to 2; assert 5 manual_edit events landed.

Commits (4):

fix: drawer delete_turn guards event_id <= 0 (T110.1)
fix: drawer delete-impact modal HTML escapes user-controllable fields (T110.2)
refactor: drawer delete-impact modal extracted to Jinja partial (T110.3)
feat: drawer bulk significance re-rate per chat (T110.4)

Wave 4 — Search UX enhancements (single)

Task 111: FTS highlighting + deep-link to turn

Files:

Modify: chat/services/cross_chat_search.py
Modify: chat/web/search.py
Modify: chat/templates/search.html
Modify: tests/test_search_ux.py

Spec (2 sub-fixes, 2 commits):

FTS highlighting via snippet() (T100 nit): replace the pov_summary column in search_all_memories's SELECT with snippet(memories_fts, 0, '<mark>', '</mark>', '…', 32) to return a highlighted snippet around the match. The template renders this raw via |safe (the snippet is built by SQLite from indexed content; the <mark> tags are the only HTML, and SQLite escapes any HTML special chars in the source content).
Deep-link to turn via memories.event_id (T100 nit + T109 dependency): now that memories.event_id exists (from T109), each search result row knows the originating event id. The chat page uses turn-id stamping (Phase 3.5 T86 added id="turn-{event_id}"). Build result links as /chats/{chat_id}#turn-{event_id}. The chat page DOM scrolls to the anchor on load (browser default).

Tests:

test_search_results_include_fts_snippet_with_highlight — seed memory with text containing "rabbit"; search for "rabbit"; assert response body contains <mark>rabbit</mark> (or whatever marker the snippet uses).
test_search_result_link_includes_turn_anchor — seed memory with known event_id; search; assert link href contains #turn-{event_id}.

Commits (2):

feat: cross-chat search FTS snippet highlighting (T111.1)
feat: cross-chat search deep-links to turn via memories.event_id (T111.2)

Wave 5 — Real embedding model (single)

Task 112: Real embedding model swap

Files:

Modify: chat/llm/client.py (Protocol — add embed(text, model) -> list[float] method)
Modify: chat/llm/featherless.py (FeatherlessClient — implement embed against Featherless /v1/embeddings endpoint OR equivalent)
Modify: chat/llm/mock.py (MockLLMClient — accept canned embedding vectors)
Modify: chat/services/embeddings.py (route non-default model through client.embed())
Modify: chat/config.py (add embedding_model: str setting; default to current pseudo)
Modify: scripts/backfill_embeddings.py (re-embed-all option for model swaps)
Modify: tests/test_embeddings.py + tests/test_llm_mock.py + tests/test_featherless.py (if exists)

Spec:

Phase 4 ships a deterministic SHA-256 pseudo-embedding (deterministic but semantically meaningless). T112 wires the path for a real embedding model.

Steps:

Extend LLMClient Protocol with async def embed(self, text: str, *, model: str) -> list[float].

Implement on FeatherlessClient: call the Featherless OpenAI-compatible /v1/embeddings endpoint:

response = await self._http.post(
    "/v1/embeddings",
    json={"model": model, "input": text},
    headers={"Authorization": f"Bearer {self._api_key}"},
)
data = response.json()
return data["data"][0]["embedding"]

Handle rate limits (existing 2-conn semaphore covers this).

Implement on MockLLMClient: embed pops a canned vector from a new canned_embeddings queue. Tests configure this queue.
Update generate_embedding: when model != DEFAULT_EMBEDDING_MODEL, call client.embed(text, model=model) instead of falling through to fallback. Wrap in try/except — failures fall back to zero vector (existing fallback path).
Settings: add embedding_model: str = "pseudo-sha256-384" to Settings. App reads this at startup; the embedding worker (chat/services/embedding_worker.py) passes it through.
Backfill script: add --re-embed-all flag that walks ALL memories (regardless of existing embeddings_meta rows) and re-embeds with the configured model. Useful for swapping models.

Tests:

test_embed_routes_to_client_when_non_default_model — mock client with canned vector; call generate_embedding(model="bge-small-en-v1.5"); assert vector matches the canned response.
test_embed_falls_back_on_client_failure — mock client to raise; assert returns zero vector with model="fallback".
test_mock_llm_client_embed_pops_canned.
test_featherless_embed_calls_correct_endpoint (if there's an existing featherless test pattern; otherwise mock the HTTP layer).

Commits:

feat: LLMClient Protocol gains embed() method (T112.1)
feat: FeatherlessClient.embed() against /v1/embeddings (T112.2)
feat: generate_embedding routes non-default models through client.embed (T112.3)
feat: backfill_embeddings --re-embed-all flag for model swaps (T112.4)

Wave 6 — Branching read-side filter (single, BIG)

Task 113: Branching read-side filter

Files (cross-cutting):

Modify: chat/services/turn_common.py::read_recent_dialogue — filter events to active branch's range
Modify: chat/services/scene_summarize.py::_read_recent_dialogue (similar)
Modify: chat/state/memory.py::search_memories — memories should be filtered to active branch (memories.event_id from T109 enables this)
Modify: chat/state/branches.py — add helper active_branch_event_ids(conn) -> tuple[int, int] returning (origin, head)
Add tests across multiple files
Modify: tests/test_branching.py — add cross-feature tests

Spec:

Phase 4 T89 + T94 shipped branching as metadata-only (the table tracks branches; the drawer UI can switch). But event readers DON'T consult is_active — they read the entire event_log. So switching branches has no functional effect.

T113 wires the filter:

Helper active_branch_event_ids(conn) -> tuple[int, int]: returns (origin_event_id, head_event_id) for the currently active branch. For "main" with origin=0 + head=N, returns (0, N) meaning "all events visible".
Apply filter in every event reader that returns historical state:
- read_recent_dialogue: WHERE clause adds id BETWEEN ? AND ? (the active branch's range).
- search_memories: WHERE clause adds m.event_id BETWEEN ? AND ? (uses T109's column).
- scene_summarize._read_recent_dialogue: same as turn_common.
- Other readers TBD — grep for event_log SELECT patterns and audit each one.
Branches that diverge: when branch B is created from event 10 and then accumulates events 11-15 (which only exist on B's timeline), but main also accumulates 11-12, the events overlap by id range. This is OK because event reads filter by id <= active_branch.head_event_id. The simpler model: branches share event_log ids globally, but each branch's "head" defines which ids are visible.
Events written under branch B carry an implicit branch tag — but the event_log table has no branch_id column today. T113 punts on cross-branch event writes (they all land in the global log) and relies on the head_event_id filter to scope reads. This is a Phase 4.5+ first cut; full branch-isolated event_log is Phase 5+.

Edge cases:

Active branch has head_event_id = 0 (just created): readers return empty.
No active branch: readers fall through to "all events visible" (defensive).
Switching branches mid-flight: each read_recent_dialogue call re-queries active_branch, so it's always current. No caching.

Tests: 5+ minimum.

test_read_recent_dialogue_respects_active_branch_head — seed 10 events; active branch head = 5; assert only first 5 returned.
test_search_memories_respects_active_branch_head — same.
test_branch_switch_changes_visible_events — switch branches; immediately read; assert different result sets.
test_main_branch_with_head_zero_returns_empty — defensive.
test_no_active_branch_falls_through_to_all_events — defensive.

Commit: feat: branching read-side filter — event readers consult active branch range (T113).

This is the largest task in Phase 4.5. Estimate 200-400 lines across multiple files. Implementer should split commits if it helps clarity (one per affected reader).

Wave 7 — Lifecycle rollback in regenerate (single)

Task 114: Lifecycle rollback

Files:

Modify: chat/services/regenerate.py
Modify: chat/db/migrations/0014_phase45_schema.sql (T109's migration) — add column? OR
Add new migration — see decision below
Modify: tests in tests/test_regenerate.py

Spec:

Phase 3.5 T83.4 shipped a warning log when regenerate detects un-rolled-back lifecycle transitions. T114 implements actual rollback.

Schema decision:

Option A: extend lifecycle event payloads with triggered_by_assistant_turn_id (no schema change needed — just a payload convention). Production code (T61 turn flow) populates it when emitting event_started/event_completed/event_cancelled. Existing rows have NULL — rollback skips them with a debug log.

Option B: add a column to event_log for stronger invariants. Significant migration cost.

Recommended: Option A. Safer, no migration, backward compatible (older events skip rollback). Document in commit body.

Rollback semantics:

When regenerate detects lifecycle events triggered by the superseded turn:

event_started → emit event_cancelled (or a NEW event_started_undone event kind that reverts status to "planned") with the same event_id.
event_completed → emit event_uncompleted (NEW event kind that reverts status from "completed" to "active").
event_cancelled → emit event_uncancelled (reverts to prior status — which we'd need to track; or simpler: emit event_started again to restore "active").

Simpler approach (recommended): add ONE new event kind event_status_reverted with payload {event_id, prior_status}. The projector sets events.status = prior_status for the event_id. Rollback emits this event for each affected lifecycle transition, looking up the prior status from the row's history (via event_log scan) or accepting it as a payload field.

Production code change: in chat/web/turns.py::post_turn (and chat/services/regenerate.py), when emitting event_started/event_completed/event_cancelled, populate triggered_by_assistant_turn_id: <id> in the payload. Forward-only — older code doesn't need updating.

Tests: 3 minimum.

test_regenerate_rolls_back_event_started_from_superseded_turn — seed an event; play a turn that starts it; regenerate; assert event_status_reverted event landed with prior_status="planned" and the events row is back to "planned".
test_regenerate_rolls_back_event_completed_to_active — same but completed → active rollback.
test_regenerate_skips_events_without_back_reference — older events without triggered_by_assistant_turn_id are not rolled back (debug log). Pin the backward-compat behavior.

Commits:

feat: lifecycle events carry triggered_by_assistant_turn_id back-reference (T114.1)
feat: event_status_reverted event kind + projector handler (T114.2)
feat: regenerate rolls back lifecycle transitions on supersede (T114.3)

Wave 8 — sqlite-vec swap (single, ENVIRONMENTAL)

Task 115: sqlite-vec swap (optional)

Files:

Create: chat/db/migrations/0015_vec0_virtual_tables.sql
Modify: chat/db/connection.py (load extension on every connection)
Modify: chat/services/vector_search.py (rewrite to use vec0 MATCH instead of pure-Python cosine)
Modify: chat/state/embeddings.py (writer needs to populate vec0 table)
Modify: pyproject.toml (add sqlite-vec dependency)

Pre-flight:

This task REQUIRES one of:

Python rebuilt with --enable-loadable-sqlite-extensions (pyenv reinstall).
apsw migration of chat/db/connection.py.

If neither is feasible at the time of execution: SKIP THIS TASK and document the deferral in T118 docs sweep. The other 13 Phase 4.5 tasks ship without it.

Spec:

Migration 0015_vec0_virtual_tables.sql:

CREATE VIRTUAL TABLE embeddings_vec USING vec0(
    memory_id INTEGER PRIMARY KEY,
    embedding FLOAT[384]
);
-- Backfill from existing JSON embeddings table.
INSERT INTO embeddings_vec (memory_id, embedding)
SELECT memory_id, vec_f32(vector_json) FROM embeddings;

chat/db/connection.py loads sqlite_vec extension on every connection:

import sqlite_vec
def open_db(...):
    conn = sqlite3.connect(...)
    conn.enable_load_extension(True)
    sqlite_vec.load(conn)
    conn.enable_load_extension(False)
    ...

Rewrite vector_search.py to use embeddings_vec MATCH ? syntax with k=? clause:

SELECT m.id, m.pov_summary, m.significance, e.distance
FROM embeddings_vec e
JOIN memories m ON m.id = e.memory_id
WHERE e.embedding MATCH ? AND k = ?
  AND m.owner_id = ?
  AND m.witness_<role> = 1
ORDER BY e.distance ASC
LIMIT ?

HNSW note: vec0 supports both flat (default) and HNSW indexes. T115 ships flat (sufficient for < few thousand memories). Document HNSW upgrade path in CLAUDE.md if memory counts ever grow past pure-Python feasibility.
Old embeddings JSON table: keep alongside embeddings_vec (data redundancy is fine; the JSON table is the source of truth and embeddings_vec is the index). Backfill on migration. Keep the embedding_indexed projector populating both.

Tests: rewrite tests/test_vector_search.py to expect new behavior. Same observable contract — only implementation changes. All 5 existing tests should pass post-swap.

Commit: feat: sqlite-vec swap (vec0 virtual tables + MATCH-based search) (T115).

Wave 9 — Polish (parallel, 3 tasks)

Task 116: Structured test-fixture builder

Files:

Create: tests/fixtures.py (or extend tests/conftest.py)
Modify: existing test files that use brittle canned-queue arrays (selectively)

Spec:

Phase 3 carry-over. Tests across test_turn_flow.py, test_meanwhile_turn_flow.py, test_phase3_integration.py, test_phase4_integration.py use positional canned-response arrays for MockLLMClient. Adding a new classifier call to a code path requires updating canned arrays in many tests.

Solution: structured fixture builder that lets tests declare their classifier expectations by name, not position:

# tests/fixtures.py
class CannedQueue:
    def __init__(self):
        self._queue = []
    def parse_turn(self, **fields): ...
    def state_update(self, **fields): ...
    def detect_scene_close(self, should_close: bool): ...
    def detect_event_transitions(self, transitions: list[dict]): ...
    def summarize_scene(self, summary: str, **fields): ...
    def detect_threads(self, candidates: list[dict]): ...
    # ... one method per classifier service
    def build(self) -> list[str]:
        return [json.dumps(item) for item in self._queue]

Usage:

def test_post_turn_with_event_transition(...):
    canned = (
        CannedQueue()
            .parse_turn(intent="narrative")
            .narrative("BotA speaks.")  # narrative is a stream, but for simplicity treat it like a canned response
            .state_update(affinity_delta=0, trust_delta=0)
            .state_update(affinity_delta=0, trust_delta=0)
            .detect_event_transitions([{"event_id": "evt_1", "new_status": "completed"}])
            .detect_scene_close(should_close=False)
            .build()
    )
    mock = MockLLMClient(canned=canned)
    # ...

Migration scope: don't migrate ALL existing tests at once — that's a separate massive refactor. Instead, ship the fixture builder + migrate 2-3 representative tests as proof of concept. Document the migration path in the fixture's docstring.

Tests: the fixture builder itself doesn't need extensive testing — it's just a builder. Add 1-2 sanity tests that the JSON output matches expected shapes.

Commit: test: structured CannedQueue fixture builder for classifier mocks (T116).

Task 117: Phase 4.5 cross-feature integration tests

Files:

Create: tests/test_phase45_integration.py

Spec:

End-to-end multi-feature flows specific to Phase 4.5 changes. 5 tests minimum.

Real embedding swap + retrieval — configure embedding_model="bge-small-en-v1.5" (mocked); write a memory; backfill or wait for worker; assert vector search returns the memory via client.embed-derived vector (not pseudo).
Branching read-side filter end-to-end — create a branch from turn 5; switch; play 3 turns on the branch; switch back to main; assert main's recent dialogue is missing the branch turns (read filter respects active branch's head).
Lifecycle rollback — start an event via a turn; regenerate that turn; assert lifecycle reverted (event back to "planned").
Search deep-link — write memories; search; click a result; verify the chat page renders with the right turn anchored (assert via TestClient response — either the browser anchor OR a server-side scroll-to-anchor mechanism).
Bulk significance re-rate end-to-end — seed 5 memories at significance 0; bulk re-rate via drawer; verify significance histogram updates.

Commit: test: phase 4.5 cross-feature integration coverage (T117).

Task 118: Phase 4.5 documentation update

Files:

Modify: CLAUDE.md
Modify: docs/plans/2026-04-26-v1-requirements-design.md (annotate §13 Phase 4 entries — though they're already shipped per Phase 4 T102)

Spec:

Mirror the Phase 3.5 / 2.5 status sections. Document:

All shipped items per task (T103–T117).
Empty out the Phase 4.5 / 5 backlog (replace with single "All items shipped" line).
Add new "Phase 5 backlog" section if any Phase 4.5 reviews surfaced new follow-ups.

Phase 5 backlog candidates (default, if no new follow-ups discovered):

Vector index optimization (HNSW) when memory counts grow past flat-index feasibility.
Branch-isolated event_log (each branch has its own physical event_log range vs the current shared id space + head filter).
Embedding model swap migration tooling — when changing models, need to re-embed everything; T112 added --re-embed-all but a more orchestrated swap (drain old worker, re-seed all memories, swap config) is Phase 5+.
Real-time collaborative branching (multi-user) — out of scope for v1.
Avatars / portraits (multimodality) — deferred indefinitely per design §14.

Commit: docs: phase 4.5 status, prune backlog, capture phase 5 candidates (T118).

Wrap-up

After Wave 9 lands:

Run full suite on phase-4.5: should be ~430+ tests passing (413 from Phase 4 + ~20 new across Phase 4.5).
Manual smoke (recommended before opening the PR):
- Configure embedding_model="bge-small-en-v1.5" (or whatever real model is chosen); restart server; play a turn; verify embedding_indexed events use the real model and search returns semantically-relevant memories.
- Create a branch, switch, play turns, switch back — verify main's history is unaffected.
- Plan an event, complete it via a turn, regenerate that turn — verify event reverts to "planned".
- Use cross-chat search; click a result; verify it lands on the right turn in the chat page.
- Bulk re-rate a chat's significance distribution.
Push phase-4.5 to gitea.
Open PR phase-4.5 → main.

Notes for the controller running this plan

T115 (sqlite-vec swap) is environmental. If pre-flight fails (no rebuilt Python, no apsw), defer to Phase 5 and ship Phase 4.5 with 13 tasks. T118 docs sweep should note the deferral.
T112 (real embedding swap) assumes Featherless or similar exposes an /v1/embeddings endpoint. If not available, document the gap and ship the Protocol + Mock impl only (Featherless impl deferred). The pseudo path remains the default in that case — same as Phase 4.
T113 (branching read-side filter) is the riskiest task. Cross-cutting. Land it on a quiet branch, test thoroughly. If integration tests break in unexpected ways, bisect the affected reader and add coverage.
After each parallel wave, run a code-review subagent. Combined spec+quality acceptable for trivial tasks (T103–T108); separate spec + quality reviewers for big tasks (T112, T113, T114, T115).
Token-spend rough estimate: Phase 4.5 should be ~50% the size of Phase 4 (similar number of tasks, mostly smaller). Big tasks (T112, T113, T114) bring the per-task spend up but parallelism in Wave 1 + Wave 9 brings the wall-clock down.
DO NOT break existing v1/v2/v3/v3.5/v4 surface contracts. Every test file that was green at the start of Phase 4.5 must stay green at the end. The cross-feature integration tests (tests/test_phase4_integration.py, tests/test_phase3_integration.py) are particularly load-bearing.

40 KiB Raw Blame History Unescape Escape

Roleplay Engine — Phase 4.5 Cleanup Plan

Pre-flight

Backlog item → task mapping

Parallel-Execution Strategy

How to dispatch a wave in parallel

After a wave completes

Conflict prevention checklist

Why each wave is parallel-safe

Task overview

Wave 1 — Independent small fixes (parallel, 6 tasks)

Task 103: branches polish

Task 104: state/memory.py polish

Task 105: snapshots.py polish

Task 106: search.py polish

Task 107: embeddings.py timeout_s fallback-path logging

Task 108: scene-close-on-cancel UX revisit

Wave 2 — Schema migration (single)

Task 109: 0014 schema migration

Wave 3 — Drawer Phase 4.5 bundle (single)

Task 110: drawer polish + bulk significance re-rate

Wave 4 — Search UX enhancements (single)

Task 111: FTS highlighting + deep-link to turn

Wave 5 — Real embedding model (single)

Task 112: Real embedding model swap

Wave 6 — Branching read-side filter (single, BIG)

Task 113: Branching read-side filter

Wave 7 — Lifecycle rollback in regenerate (single)

Task 114: Lifecycle rollback

Wave 8 — sqlite-vec swap (single, ENVIRONMENTAL)

Task 115: sqlite-vec swap (optional)

Wave 9 — Polish (parallel, 3 tasks)

Task 116: Structured test-fixture builder

Task 117: Phase 4.5 cross-feature integration tests

Task 118: Phase 4.5 documentation update

Wrap-up

Notes for the controller running this plan

40 KiB

Raw Blame History