8 tasks across 5 waves consolidating the 15-item backlog tracked in CLAUDE.md (5 from Phase 1.5 cleanup + 10 from Phase 2.5/3). Items are grouped by file ownership so each wave stays file-disjoint: - Wave 1 (parallel): open_db refactor, bot_reset orphan cleanup, LLM-merged group meta-summary - Wave 2 (single): prompt.py polish — witness role parametric, single ACTIVITIES block, NICE trim documented - Wave 3 (single): drawer polish — deferred v1 edits, first-meeting gate, witness flag editing - Wave 4 (parallel): regenerate.py polish (SSE + interjection regenerate + stale-guest cleanup); turn-flow polish + new addressee service (classifier addressee + significance for interjection + scene-close-on-cancel pinned + stale-guest cleanup) - Wave 5 (single): docs sweep No schema migrations. Bundled tasks split into per-item sub-commits for clean review bisection. Uses task ids T68-T75 to avoid collision with Phase 3 plan (T49-T67) regardless of merge order.
36 KiB
Roleplay Engine — Phase 2.5 Cleanup Plan
For Claude: REQUIRED SUB-SKILL: Use
superpowers-extended-cc:executing-plansto implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for waves that fan out to multiple subagents.
Goal: Burn down the combined Phase 1.5 + Phase 2.5/3 backlog tracked in CLAUDE.md §"Phase 1.5 cleanup backlog" and §"Phase 2.5 / 3 backlog". 15 follow-up items consolidated into 8 tasks (file-disjoint across waves) so several can run in parallel.
Architecture: No new architecture. Every change here is either a refactor (T68 open_db), a polish on an existing service/route (most tasks), or a UI affordance for state that already exists (T72 drawer edits, witness-flag editing). No new tables, no new event kinds, no schema migrations.
Tech Stack: Same as Phase 2. No new dependencies.
Source-of-truth references:
- Backlog list:
CLAUDE.md§"Phase 1.5 cleanup backlog" (5 items) + §"Phase 2.5 / 3 backlog" (10 items) = 15 items total. - Conventions:
CLAUDE.md§"Behavioral defaults" + §"Phase 2 status". - Phase 2 plan (style, TDD pattern, parallel-dispatch mechanics): 2026-04-26-v2-phase2-implementation.md.
- Phase 3 plan (in flight on a separate branch): 2026-04-26-v3-phase3-implementation.md.
When a task says "see §X", that's the requirements doc unless stated otherwise.
Pre-flight
Branch: create phase-2.5 from the latest main after Phase 2 has merged. If Phase 2 is still in PR review, branch off phase-2 directly:
# Option A: after main has phase-2 merged
git checkout main && git pull && git checkout -b phase-2.5
# Option B: continue from phase-2 directly
git checkout phase-2 && git pull && git checkout -b phase-2.5
Schema baseline: Phase 2 leaves the DB at version 8. Phase 2.5 adds no migrations. Schema-version assertion in tests/test_world.py stays at 8.
Relationship to Phase 3: Phase 3 (phase-3 branch, plan committed but not yet executed) uses task ids T49–T67. Phase 2.5 uses T68–T75 to avoid collision regardless of merge order.
Pinned non-negotiables (carried forward from Phases 1 + 2):
- State changes go through the event log. Use
append_and_apply(conn, kind, payload)for the live path;apply_eventonly after a freshappend_eventreturning the new id. - Witness filter every memory read at SQL level (hard
WHEREconstraint; never a soft signal). - Edges are directed;
botA → botBandbotB → botAare independent records. - Per-POV scene summaries — never write omniscient narration.
- TDD: every task starts with a failing test (or, for refactors that preserve behavior, a regression test that pins the existing contract before any change).
- One commit per task minimum. Tasks that bundle 3+ small backlog items SHOULD split commits within the task — one commit per backlog item — so review can bisect cleanly.
Verification before claiming done: Use superpowers-extended-cc:verification-before-completion — run the test command, paste actual output. Don't assume green.
Backlog item → task mapping
15 items consolidated into 8 tasks by file ownership (so each wave's tasks stay file-disjoint). Bundled tasks may split commits internally.
| # | Backlog item | Source | Task |
|---|---|---|---|
| 1 | open_db refactor with check_same_thread parameter |
Phase 1.5 | T68 |
| 2 | Regenerate broadcasts turn_html over SSE |
Phase 1.5 | T73 |
| 3 | bot_reset purges orphaned "you" activity rows |
Phase 1.5 | T69 |
| 4 | Drawer edits for deferred v1 fields (edge_trust, edge_summary, memory pov_summary, knowledge_facts) | Phase 1.5 | T72 |
| 5 | NICE trim order in prompt assembly | Phase 1.5 | T71 |
| 6 | Interjection regenerate | Phase 2.5 | T73 |
| 7 | Classifier-based addressee detection | Phase 2.5 | T74 |
| 8 | LLM-merged group meta-summary | Phase 2.5 | T70 |
| 9 | First-meeting gate (drawer "have they met?" toggle) | Phase 2.5 | T72 |
| 10 | Witness flag editing in drawer | Phase 2.5 | T72 |
| 11 | Significance for interjection memories | Phase 2.5 | T74 |
| 12 | Stale guest reference defensive degrade removal | Phase 2.5 | T73 + T74 (split by file) |
| 13 | Scene close on cancel review | Phase 2.5 | T74 |
| 14 | Dual ACTIVITIES: block consolidation |
Phase 2.5 | T71 |
| 15 | Witness role hardcode in prompt assembly | Phase 2.5 | T71 |
| — | Docs sweep — remove shipped items from CLAUDE.md | (this plan) | T75 |
Parallel-Execution Strategy
Same pattern as Phases 2 and 3. Five waves: parallel within each wave (file-disjoint), serial across waves. Cross-wave merges keep phase-2.5 green between dispatches.
How to dispatch a wave in parallel
Use the Agent tool with isolation: "worktree" so each subagent gets its own git worktree. (If the controlling session's working directory is not the chat repo, create worktrees manually with git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-2.5 from inside the chat repo and pass the worktree path explicitly into each subagent prompt — that is the pattern Phase 2 used.)
In a single message, dispatch all tasks in the wave:
Agent({
description: "Wave 1 — T68 open_db refactor",
subagent_type: "general-purpose",
isolation: "worktree",
prompt: "<full task text from below>",
})
Agent({ ...T69... })
Agent({ ...T70... })
After a wave completes
-
Each subagent returns its worktree path and commit SHA(s).
-
Run a spec + code-quality reviewer subagent on each completed task. Combined review is acceptable for purely mechanical refactors (T68, T69); separate spec + quality reviewers for tasks that bundle multiple backlog items (T71, T72, T74).
-
Merge the wave into
phase-2.5in any order (file-disjointness guarantees no conflict). Use--no-ff:git checkout phase-2.5 for branch in <wave-branches>; do git merge --no-ff "$branch" -m "merge: <task description>" done -
Run the full test suite on the merged
phase-2.5. If it's red, the wave's mutual-independence assumption was violated — bisect the offending pair, fix, re-merge. -
Push
phase-2.5to gitea so the work is durable before the next wave starts. -
Optionally clean up worktrees:
git worktree remove .worktrees/<branch>andgit branch -D <branch>.
Conflict prevention checklist (apply before dispatch)
For each parallel wave, verify the Files sections of all tasks have no overlapping paths. The waves below are designed to satisfy this; if you decide to add or merge tasks, re-check.
The hot files in this plan are: chat/web/turns.py, chat/services/regenerate.py, chat/web/drawer.py, chat/templates/_drawer.html, chat/services/prompt.py. Each is owned by exactly one task in this plan.
Failure recovery
If one subagent fails: cancel it, merge the others' successful work, re-dispatch the failed task as a single follow-up. Don't block the wave.
If a failure exposes a bad assumption shared by multiple tasks (e.g., a refactor that requires a wider blast radius than the plan accounted for), pause the wave and revisit.
Why each wave is parallel-safe
| Wave | Tasks | Hot files touched | Disjoint? |
|---|---|---|---|
| 1 | T68, T69, T70 | chat/db/connection.py + chat/web/bots.py (T68); chat/state/entities.py (T69); chat/services/scene_summarize.py (T70) |
✅ |
| 2 | T71 | chat/services/prompt.py |
(single task) |
| 3 | T72 | chat/web/drawer.py + chat/templates/_drawer.html |
(single task) |
| 4 | T73, T74 | chat/services/regenerate.py (T73); chat/web/turns.py + new chat/services/addressee.py (T74) |
✅ |
| 5 | T75 | CLAUDE.md |
(single task) |
Task overview
Wave 1 ─┬─ T68: open_db refactor with check_same_thread param
├─ T69: bot_reset purges orphaned "you" activity rows
└─ T70: LLM-merged group meta-summary
Wave 2 ─── T71: prompt.py polish (NICE trim order + dual ACTIVITIES + witness role parametric)
Wave 3 ─── T72: drawer.py polish (deferred v1 edits + first-meeting gate + witness flag editing)
Wave 4 ─┬─ T73: regenerate.py polish (turn_html SSE + interjection regenerate + stale-guest cleanup)
└─ T74: turn-flow polish + addressee service (classifier addressee detection +
significance for interjection + scene close on cancel + stale-guest cleanup)
Wave 5 ─── T75: docs sweep — remove shipped items from CLAUDE.md backlogs
Critical path: 5 sequential merge points. Total tasks: 8. Wall-clock parallelism advantage: Waves 1 and 4 dispatch concurrently; Waves 2, 3, 5 are single-task by file constraint.
Wave 1 — Independent small fixes (parallel)
Three tasks, fully file-disjoint.
Task 68: open_db refactor with check_same_thread parameter
Files:
- Modify:
chat/db/connection.py(extendopen_db(path, *, check_same_thread=True)so callers can opt out of SQLite's main-thread requirement) - Modify:
chat/web/bots.py(use the new parameter inget_connrather than hand-rolling its own context-manager body) - Modify: tests in
tests/test_connection.py(or whereveropen_dbis tested; add 1 test for the new parameter)
Spec: Currently chat/web/bots.py:get_conn() duplicates the body of open_db so it can pass check_same_thread=False. Extend open_db to accept this as a kwarg (default True, preserving existing behavior). Then have get_conn call open_db(...) directly. The PRAGMA setup (WAL, foreign_keys, synchronous, etc.) stays in one place.
Step 1: failing test — add a regression test that pins the existing contract:
def test_open_db_default_uses_check_same_thread_true(tmp_path):
db = tmp_path / "t.db"
apply_migrations(db)
with open_db(db) as conn:
# Default is check_same_thread=True; calling from another thread should fail.
...
def test_open_db_can_disable_check_same_thread(tmp_path):
db = tmp_path / "t.db"
apply_migrations(db)
with open_db(db, check_same_thread=False) as conn:
# Same conn callable from another thread now.
...
Step 3: implementation — add check_same_thread: bool = True to open_db. Pass through to sqlite3.connect. Then in chat/web/bots.py, replace the duplicated context-manager body with open_db(path, check_same_thread=False).
Step 5: commit — refactor: open_db with check_same_thread parameter (T68).
Notes for implementer:
- This is a refactor — the full test suite must be GREEN before AND after. Run before to baseline, run after to confirm no regressions. Pay special attention to
tests/test_bots.pyif it exercises theget_connpath. - Do NOT change the default. Existing callers don't pass
check_same_threadand must continue to getTrue.
Task 69: bot_reset purges orphaned "you" activity rows
Files:
- Modify:
chat/state/entities.py(extend_apply_bot_resetwith one moreDELETEclause for "you" activity rows tied to chats that this bot hosted) - Modify: tests in
tests/test_reset.py(add 2 tests)
Spec: Currently _apply_bot_reset purges the bot's chats, the bot's own activity rows, the bot's memories, and edges involving the bot. Phase 2 T47 added a chats.guest_bot_id cascade. Still missing: when bot A's chats are deleted, "you"-owned activity rows that were associated with those chats' containers are not cleaned up. They linger as orphaned activity entries pointing at deleted containers.
The fix per the existing CLAUDE.md note:
DELETE FROM activity
WHERE entity_id = 'you'
AND container_id IN (SELECT id FROM containers WHERE chat_id IN (
SELECT id FROM chats WHERE host_bot_id = ?
));
Order matters: this DELETE must run BEFORE the DELETE FROM containers and DELETE FROM chats clauses — otherwise the subqueries return no rows. Verify ordering in the existing handler before placing the new line.
Tests: 2 added.
test_reset_purges_orphaned_you_activity_rows: seed bot_a, chat_bot_a, a container in chat_bot_a, and a "you" activity row pointing at that container. Reset bot_a. AssertSELECT COUNT(*) FROM activity WHERE entity_id = 'you'is 0.test_reset_does_not_purge_you_activity_in_other_chats: seed bot_a + bot_b, both with chats and "you" activity in each. Reset bot_a. Assert "you" activity in chat_bot_a is gone, but "you" activity in chat_bot_b is preserved.
Commit: fix: bot_reset purges orphaned 'you' activity rows (T69).
Task 70: LLM-merged group meta-summary
Files:
- Modify:
chat/services/scene_summarize.py(replace the naivef"{host_name}: {host_summary}\n\n{guest_name}: {guest_summary}"with an LLM-merged group view via a new classifier wrapper) - Modify: tests in
tests/test_per_pov_summary.py(replace the regression test for naive concat with one that asserts the merged text uses the classifier output; keep the existing per-POV memory tests intact)
Spec: Phase 2 T45 wrote a stub for group_node.summary that just concatenated the two per-POV summaries. Replace it with a small classifier call that produces a coherent group-level summary from both POVs.
Add a new helper at the bottom of scene_summarize.py:
class GroupMetaSummary(BaseModel):
summary: str = ""
dynamic: str = ""
async def merge_group_summary(
client: LLMClient,
*,
classifier_model: str,
host_name: str,
host_pov_summary: str,
guest_name: str,
guest_pov_summary: str,
timeout_s: float = 30.0,
) -> GroupMetaSummary:
"""Merge two per-POV scene summaries into a coherent group-level
summary + group-dynamic note. Falls back to the naive concat on
classifier failure."""
System prompt: "Given two per-POV scene summaries from a 3-entity scene (you + host + guest), produce a coherent group-level summary capturing the shared events as both witnesses experienced them, plus a brief 'dynamic' note describing the trio's group dynamic during the scene." Output strict JSON matching schema. Default = GroupMetaSummary(summary=f"{host_name}: {host_pov_summary}\n\n{guest_name}: {guest_pov_summary}", dynamic="") (the existing naive concat preserved as fallback so a classifier failure doesn't degrade behavior).
In apply_scene_close_summary, replace the naive concat call site (the existing summary= kwarg of the group_node_updated event) with await merge_group_summary(...) and use its .summary and .dynamic outputs.
Tests: 3 in tests/test_per_pov_summary.py.
test_group_summary_merges_per_pov_via_classifier_when_guest_present: mock the classifier withGroupMetaSummary(summary="merged summary", dynamic="warm rapport"). Close a scene with guest. Assertget_group_node(...).summary == "merged summary"and.dynamic == "warm rapport".test_group_summary_falls_back_to_naive_concat_on_classifier_failure: mock classifier with bad JSON across all 3 retries. Close scene. Assertsummarymatches the old naive concat format.dynamicis empty.test_group_summary_skipped_when_no_guest: no-guest path unchanged —group_node_updatednot emitted at all (existing behavior).
Commit: feat: LLM-merged group meta-summary (T70).
Wave 2 — prompt.py polish (single task)
T71 bundles three prompt-assembly cleanups. All touch chat/services/prompt.py. Single task because the file is hot; the implementer SHOULD split into 3 commits within the task for clean review bisection.
Task 71: prompt.py polish (NICE trim order + dual ACTIVITIES + witness role parametric)
Files:
- Modify:
chat/services/prompt.py - Modify:
tests/test_prompt.py(add tests; preserve existing 10 tests)
Spec: Three independent cleanups bundled because the file is hot.
71.1 — Witness role parametric (Phase 2.5 backlog #15)
chat/services/prompt.py:436 (or wherever the call site is — verify) calls search_memories(conn, speaker_bot_id, "host", query, k=4) with witness_role="host" hardcoded. This is wrong when the speaker is the guest (the guest queries with witness_role="guest" should hit a different SQL filter).
Fix: derive the role from chat membership.
def _witness_role_for(speaker_bot_id: str, host_bot_id: str) -> str:
return "host" if speaker_bot_id == host_bot_id else "guest"
Apply at the call site. The test contract is already pinned in tests/test_witness_filter_multi.py from Phase 2 T46 — those tests will continue to pass; this change unblocks guest-as-speaker in production.
Commit: fix: witness role parametric in prompt assembly (T71.1).
71.2 — Dual ACTIVITIES: block consolidation (Phase 2.5 backlog #14)
T43 (Phase 2) added a second ACTIVITIES: block to render guest activity separately from you+speaker activity (so the trim ladder could drop guest activity first under tight budget). Two consecutive ACTIVITIES: headers can read as a duplicate-section bug to the LLM.
Refactor to a single ACTIVITIES: block with three bullets (you, speaker, guest), where each bullet is independently trimmable: under tight budget, drop the guest bullet first, then the you bullet, keeping the speaker bullet (the speaker's own current activity is MUST-tier).
Implementation: the existing trim machinery uses block-level granularity. Extend it to bullet-level granularity for this block (one new helper or one new tier name like MUST-bullet / SHOULD-bullet / NICE-bullet — pick whichever is least disruptive).
Commit: refactor: single ACTIVITIES: block with bullet-level trim (T71.2).
71.3 — NICE trim order revisit (Phase 1.5 backlog #5)
Per T18 review: the NICE trim drops previous-scene first instead of last (the spec listing order was previous-scene last). Greedy-cuts heuristic vs. spec.
Revisit: review the trim ordering carefully. If real play surfaces a regression (the previous-scene block is genuinely important to bot continuity), reverse the NICE order so previous-scene drops last. If not, document the intentional deviation in a code comment and call it done.
This is a judgment call. Default action: leave the order as-is and add a comment explaining why (the heuristic is "drop the cheapest-impact thing first; greedy lookahead is more expensive than the marginal narrative loss"). If review feedback during execution disagrees, reverse the order.
Commit: chore: document NICE trim order rationale (T71.3) OR fix: NICE trim order drops previous-scene last (T71.3).
Tests for T71
Add to tests/test_prompt.py:
test_speaker_is_guest_uses_guest_witness_role: speaker=guest_id. Patchsearch_memoriesto record itswitness_roleargument. Assert called with"guest", not"host".test_single_activities_block_with_three_bullets_when_3_entities: 3-entity prompt. Assert exactly oneACTIVITIES:header present. Assert bullets for you, speaker, guest.test_tight_budget_drops_guest_activity_bullet_first: 3-entity prompt with budget tight enough to force trim. Assert speaker activity bullet survives, guest activity bullet is dropped.- (Optional, depends on 71.3 outcome)
test_nice_trim_order_drops_previous_scene_last: only add if you choose to fix the order.
Verification gates:
pytest tests/test_prompt.py -v— 10 existing + 3-4 new all pass.pytest tests/test_witness_filter_multi.py -v— Phase 2 T46 tests still pass (proves the witness-role fix didn't break anything).- Full suite green.
Wave 3 — drawer.py polish (single task)
T72 bundles three drawer affordances. All touch chat/web/drawer.py and chat/templates/_drawer.html. Single task by file constraint; implementer SHOULD split into 3 commits.
Task 72: drawer polish (deferred v1 edits + first-meeting gate + witness flag editing)
Files:
- Modify:
chat/web/drawer.py(add 4-5 new POST routes for the deferred v1 edits + 1 GET extension for first-meeting gate + 1 POST for witness flag editing) - Modify:
chat/templates/_drawer.html(forms for each new edit affordance) - Create:
tests/test_drawer_edits_extended.py(new tests for the new routes; existingtests/test_drawer_edits.pyandtests/test_drawer_guest.pystay unchanged)
Spec: Three independent backlog items.
72.1 — Deferred v1 drawer edits (Phase 1.5 backlog #4)
The manual_edit projector already supports target_kind values for edge_trust, edge_summary, memory_pov_summary. These work end-to-end at the state layer; only the drawer routes are missing.
Add 4 new POST routes:
POST /chats/{chat_id}/drawer/edge/trust— form{source_id, target_id, new_value}(0–100 int). Appendsmanual_editwithtarget_kind="edge_trust",prior_value=current_trust,new_value=.... Validate range; 400 on out-of-bounds.POST /chats/{chat_id}/drawer/edge/summary— form{source_id, target_id, new_summary}(text). Appendsmanual_editwithtarget_kind="edge_summary". No validation beyond non-empty + reasonable length cap (e.g., 2000 chars).POST /chats/{chat_id}/drawer/memory/pov-summary— form{memory_id, new_summary}. Appendsmanual_editwithtarget_kind="memory_pov_summary". 404 if memory not in this chat or not owned by a present bot.POST /chats/{chat_id}/drawer/edge/knowledge-facts— form{source_id, target_id, action: 'add'|'remove', fact: str}. Knowledge_facts needs a NEW dispatch branch in themanual_editprojector — add it as part of this task:target_kind="edge_knowledge_fact"with payload action + fact.
The existing drawer template has read-only renders for these fields. Replace with editable forms (textarea + slider + button).
Tests in tests/test_drawer_edits_extended.py:
- One test per route (4 tests minimum) asserting: the manual_edit event lands; the projected state changes; the response contains the updated drawer partial.
Commit: feat: drawer edits for edge_trust / edge_summary / memory_pov_summary / knowledge_facts (T72.1).
72.2 — First-meeting gate (Phase 2.5 backlog #9)
The "Add guest" form's relationship_prose textarea fires every time. In Phase 2 T42's notes: "fire it every time a (host, guest) pair has no existing host → guest edge."
Implement the gate: when the user opens the Add-guest form, check whether get_edge(conn, host_bot_id, guest_bot_id) already exists. If yes:
- Render the textarea disabled with the message "they already know each other (edge exists from a prior chat)" + a small "re-seed anyway" toggle that re-enables the textarea.
- If the user submits without toggling, skip the relationship-seed call (existing edge content stays).
- If the user toggles re-seed and submits prose, the existing flow runs —
seed_inter_bot_edgesproduces deltas, twoedge_updateevents fire on top of the existing edge content.
Tests:
test_add_guest_form_disables_prose_when_edge_exists: pre-seed a host→guest edge from a prior chat; render the form; assert the textarea hasdisabledattribute AND the "they already know each other" message is in the body.test_add_guest_with_existing_edge_skips_seed_call: pre-seed edge; submit form without toggling re-seed; assert classifier mock was NOT called (count check on canned-response queue).
Commit: feat: first-meeting gate on drawer Add-guest form (T72.2).
72.3 — Witness flag editing (Phase 2.5 backlog #10)
Memories show witness flags [you, host, guest] read-only in the drawer. Add an inline-edit affordance: each flag becomes a checkbox; toggling submits a manual_edit event with target_kind="memory_witness", payload {memory_id, flag: 'you'|'host'|'guest', new_value: bool}.
The manual_edit projector needs a new dispatch branch for memory_witness — same as the knowledge_facts branch in 72.1; do them together if cleaner.
Tests: 2.
test_witness_flag_toggle_updates_memory_row: seed memory with witness[1, 1, 0]. POST toggle onguestflag → 1. Project. Assertmemories.witness_guest = 1.test_witness_flag_toggle_emits_manual_edit_event: same setup; assert the manual_edit event has the righttarget_kindandprior_value/new_value.
Commit: feat: drawer witness flag inline-edit (T72.3).
Wave 4 — Turn-flow polish (parallel)
Two tasks, file-disjoint. T73 owns chat/services/regenerate.py; T74 owns chat/web/turns.py + adds a new addressee-detection service.
Each task bundles multiple backlog items. Implementer should split commits within each task.
Task 73: regenerate.py polish
Files:
- Modify:
chat/services/regenerate.py - Modify:
tests/test_regenerate.py(add tests; existing tests preserved)
Spec: Three regenerate-related backlog items.
73.1 — Regenerate broadcasts turn_html over SSE (Phase 1.5 backlog #2)
After the new assistant_turn lands, broadcast a turn_html event over the chat's pub/sub channel — mirror the broadcast logic in chat/web/turns.py:post_turn. The existing post_turn does this via publish(chat_id, {"event": "turn_html", "html": ...}) (or similar — verify). Use the same render path so connected tabs swap the regenerated turn live, no refresh required.
Test: test_regenerate_broadcasts_turn_html_over_sse — mock publish and assert it was called with the new assistant_turn's rendered HTML.
Commit: feat: regenerate broadcasts turn_html over SSE (T73.1).
73.2 — Interjection regenerate (Phase 2.5 backlog #6)
Phase 2 T44 deferred interjection regenerate: regenerate currently only acts on the addressee turn. Extend so that when a turn group has both a primary assistant_turn and an assistant_turn flagged as interjection_of=..., regenerate redoes BOTH — the primary first, then the interjection (using the same interjection-decision classifier path as post_turn). The interjection branch may decide should_interject=False on the regenerate, in which case the previous interjection_turn is superseded but no new interjection is appended.
Test: test_regenerate_with_interjection_redoes_both_turns — seed a 3-entity scene with a prior primary + interjection; regenerate; assert two new assistant_turns land (or one new + a supersede-without-replace if the regenerated decision was "no interjection").
Commit: feat: regenerate covers interjection turns (T73.2).
73.3 — Stale-guest defensive degrade cleanup in regenerate.py (Phase 2.5 backlog #12, partial)
Phase 2 T44 added a defensive degrade-to-1:1 in regenerate.py when chat.guest_bot_id points at a deleted bot. T47 fixed the root cause (resets clear the reference). The defensive degrade is now dead code.
Remove the degrade block; let the function trust that chat.guest_bot_id is either valid or NULL. The corresponding existing test for the defensive degrade can be removed (the bot_reset cascade test in tests/test_reset.py already covers the root-cause behavior).
Commit: chore: remove defensive stale-guest degrade in regenerate.py (T73.3).
Verification gates
pytest tests/test_regenerate.py -v— existing + new all pass.- Full suite green.
Task 74: turn-flow polish + new addressee-detection service
Files:
- Modify:
chat/web/turns.py - Create:
chat/services/addressee.py(new classifier wrapper for addressee detection) - Create:
tests/test_addressee.py - Modify:
tests/test_turn_flow.py(add tests; existing 8 tests preserved)
Spec: Four turn-flow backlog items.
74.1 — Classifier-based addressee detection (Phase 2.5 backlog #7)
Phase 2 T44's _detect_addressee_id uses a substring whole-word regex match. This is brittle: bot names that are common English words (e.g., a bot named "Sam"), names appearing inside a quoted aside ("Did you see what Sam wrote in his letter?" — addressed to host, not Sam), or fuzzy references all break it.
Replace with a small classifier call. New module chat/services/addressee.py:
class AddresseeDecision(BaseModel):
addressee_id: str # bot id, "you", or "host" as fallback
confidence: str = "medium" # "high" | "medium" | "low"
reason: str = ""
async def detect_addressee(
client: LLMClient,
*,
classifier_model: str,
user_prose: str,
host_id: str,
host_name: str,
guest_id: str | None,
guest_name: str | None,
timeout_s: float = 30.0,
) -> AddresseeDecision:
"""Classify which present bot the user is addressing in this turn.
Defaults to host on failure or low confidence."""
System prompt: "Given a user's turn prose and the names of present bots, decide which bot the user is addressing. If the user is speaking to no specific bot (descriptive narration, action without dialogue), default to the host. Output strict JSON."
Default fallback (classifier failure) = AddresseeDecision(addressee_id=host_id, confidence="low", reason="fallback").
In chat/web/turns.py, replace _detect_addressee_id with a call to detect_addressee. Keep the substring helper as a low-confidence pre-filter for the no-guest case (no LLM call needed when only one bot is present — preserves throughput).
Tests:
tests/test_addressee.py(new file): 3 tests — classifier returns guest, classifier returns host, classifier failure falls back to host.tests/test_turn_flow.py: updatetest_addressee_detection_routes_to_named_botfrom Phase 2 T44 to use the new classifier path. (Existing test should keep passing with the new mock orchestration; canned-response queue may need an extra slot for the addressee decision.)
Commit: feat: classifier-based addressee detection (T74.1).
74.2 — Significance for interjection memories (Phase 2.5 backlog #11)
Phase 2 T44 noted: the interjection branch's memory_written event doesn't enqueue a SignificanceJob. Wire it in: after the interjection memory write (the record_turn_memory_for_present call in the interjection branch), enqueue a SignificanceJob with the interjection's host memory id (mirror the primary turn's enqueue at the end of the primary branch).
If both host and guest memory ids exist for the interjection (as they will when both are present), enqueue once for the host id (the existing pattern for primary turns — the score applies to both POVs since the prose is identical at the time of write).
Test: test_interjection_enqueues_significance_job — mock the worker; trigger an interjection; assert SignificanceJob was enqueued with the interjection memory id.
Commit: fix: enqueue significance for interjection memories (T74.2).
74.3 — Scene close on cancel review (Phase 2.5 backlog #13)
Phase 2 T44 review noted: when a primary turn is cancelled mid-stream, scene close still runs. Behavior may be intentional (close detection looks at user prose, not bot output) or wrong (a cancelled turn is incomplete; closing the scene on it is premature).
Decision for this task: review the call path. If the close detection truly only consults user prose AND the user prose is fully present at the moment of cancel (it is — user prose is appended before the stream starts), the existing behavior is correct: a cancelled turn doesn't invalidate the user's intent to close the scene. Document this in a code comment near the close-detection branch.
If a play-test surfaces a regression (e.g., a user cancels because the bot misread their close intent), revisit. Default: document and close as a no-op.
Test: test_cancelled_turn_still_closes_scene_when_user_prose_signals_close — pin the existing behavior so a future refactor doesn't quietly change it.
Commit: chore: pin scene-close-on-cancel behavior + comment rationale (T74.3).
74.4 — Stale-guest defensive degrade cleanup in turns.py (Phase 2.5 backlog #12, partial)
Same as T73.3 but for chat/web/turns.py: T44's defensive degrade-to-1:1 in post_turn (lines 235-242 per the T44 implementer note) is dead code now that T47 fixed the root cause. Remove it.
Commit: chore: remove defensive stale-guest degrade in turns.py (T74.4).
Verification gates
pytest tests/test_addressee.py -v— 3/3 new tests pass.pytest tests/test_turn_flow.py -v— existing 8 + new 2-3 all pass.pytest tests/test_reset.py -v— Phase 2 T47 root-cause cascade still green.- Full suite green.
Wave 5 — Docs sweep (single task)
Task 75: Remove shipped items from CLAUDE.md backlogs
Files:
- Modify:
CLAUDE.md
Spec: Walk through the 15 backlog items in CLAUDE.md §"Phase 1.5 cleanup backlog" and §"Phase 2.5 / 3 backlog". For each item shipped during Phases 2.5 (T68–T74), remove it from the backlog list. Add a new section "Phase 2.5 status" near the existing "Phase 2 status" section listing what shipped:
open_dbrefactor (T68).bot_resetpurges orphaned "you" activity rows (T69).- LLM-merged group meta-summary (T70).
- Prompt assembly polish: witness role parametric, single ACTIVITIES block, NICE trim documented (T71).
- Drawer edits for deferred v1 fields, first-meeting gate, witness flag editing (T72).
- Regenerate over SSE + interjection regenerate + stale-guest cleanup (T73).
- Classifier-based addressee detection + significance for interjection + scene-close-on-cancel pinned + stale-guest cleanup (T74).
If any task during execution chose NOT to ship a sub-item (e.g., T71.3 left NICE trim unchanged with a documented rationale), keep that sub-item in a "Phase 3.5+ deferred" section with the rationale. The goal is for the backlog list to reflect actual repo state, not aspirational scope.
If any new follow-ups were discovered during T68–T74 reviews, add them to the appropriate backlog section.
Commit: docs: phase 2.5 status, prune shipped backlog items (T75).
Wrap-up
After Wave 5 lands:
- Run full suite on
phase-2.5: should be ~225+ tests passing (212 from Phase 2 + ~15 new across the 8 tasks). - Manual smoke (recommended before opening the PR):
- Drawer: edit edge_trust on a chat; verify the new value sticks after refresh.
- Drawer: edit edge_summary on a chat; refresh; verify.
- Drawer: toggle a memory's witness flag; refresh; verify.
- Drawer: open Add-guest form for a (host, guest) pair that already shares an edge; verify the gate disables the prose textarea.
- Drawer: open Add-guest form for a fresh pair; verify the textarea is enabled.
- Reset a bot; verify "you" activity rows for that bot's chats are gone (run
sqlite3 data/db.sqlite "SELECT * FROM activity WHERE entity_id='you'"before/after). - Multi-tab: open two tabs on the same chat; click Regenerate on one; verify the other tab sees the new turn live (no refresh).
- Trigger an interjection turn; check the worker queue or
significance_jobstable; verify a job was enqueued for the interjection memory. - Use a bot with a name that's a common word ("Sam"); ask "did you see what Sam wrote?" — verify host gets the floor (classifier addressee detection, not substring).
- Push
phase-2.5to gitea. - Open PR
phase-2.5 → main. - No new Phase 3+ backlog items expected — if review surfaces any, add to CLAUDE.md.
Notes for the controller running this plan
- Don't dispatch Wave 4 until Wave 3 is merged AND tested green on
phase-2.5. T74 references the new addressee service path that's stand-alone, but the existing tests intests/test_turn_flow.pymay have shifted from Wave 3 if the drawer-test fixture interactions touch shared state. Verify green before fanning out. - After each parallel wave, run a code-review subagent (
subagent-driven-developmentskill's two-stage review pattern) on each task. For purely mechanical tasks (T68, T69), combined spec+quality is acceptable. For bundled tasks (T71, T72, T74), use separate spec + quality reviewers — the surface area is larger. - If Phase 3 (
phase-3branch) is in flight in parallel, T75 (the docs sweep) should land onphase-2.5only — Phase 3's docs sweep (T67) is independent. Both will resolve when the two branches merge tomainin some order; expect a small CLAUDE.md merge to reconcile any overlapping backlog edits. - If a task's "split commits" guidance proves impractical (e.g., bundling means a test pins 3 fixes at once), one consolidated commit is acceptable. The split is an aid for review bisection, not a hard rule.
- Token-spend rough estimate: Phase 2.5 should be ~50% the size of Phase 2 (smaller scope, all reuse). Per-task token spend similar to Phase 2's smaller tasks (T36, T37, T47).
- DO NOT break existing v1 / v2 surface contracts. Every test file that was green at the start of Phase 2.5 must stay green at the end. The
tests/test_witness_filter_multi.pycontracts pinned in Phase 2 T46 are particularly load-bearing for T71.1 — verify them after the witness-role parametric fix lands.