Files

T

Joseph Doherty e05f28e9d5 docs: add Phase 2.5 cleanup plan (Phase 1.5 + 2.5/3 backlog)

8 tasks across 5 waves consolidating the 15-item backlog tracked in
CLAUDE.md (5 from Phase 1.5 cleanup + 10 from Phase 2.5/3). Items are
grouped by file ownership so each wave stays file-disjoint:

- Wave 1 (parallel): open_db refactor, bot_reset orphan cleanup,
  LLM-merged group meta-summary
- Wave 2 (single): prompt.py polish — witness role parametric, single
  ACTIVITIES block, NICE trim documented
- Wave 3 (single): drawer polish — deferred v1 edits, first-meeting
  gate, witness flag editing
- Wave 4 (parallel): regenerate.py polish (SSE + interjection
  regenerate + stale-guest cleanup); turn-flow polish + new addressee
  service (classifier addressee + significance for interjection +
  scene-close-on-cancel pinned + stale-guest cleanup)
- Wave 5 (single): docs sweep

No schema migrations. Bundled tasks split into per-item sub-commits
for clean review bisection. Uses task ids T68-T75 to avoid collision
with Phase 3 plan (T49-T67) regardless of merge order.

2026-04-26 17:02:46 -04:00

36 KiB

Raw Blame History

Roleplay Engine — Phase 2.5 Cleanup Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for waves that fan out to multiple subagents.

Goal: Burn down the combined Phase 1.5 + Phase 2.5/3 backlog tracked in CLAUDE.md §"Phase 1.5 cleanup backlog" and §"Phase 2.5 / 3 backlog". 15 follow-up items consolidated into 8 tasks (file-disjoint across waves) so several can run in parallel.

Architecture: No new architecture. Every change here is either a refactor (T68 open_db), a polish on an existing service/route (most tasks), or a UI affordance for state that already exists (T72 drawer edits, witness-flag editing). No new tables, no new event kinds, no schema migrations.

Tech Stack: Same as Phase 2. No new dependencies.

Source-of-truth references:

Backlog list: CLAUDE.md §"Phase 1.5 cleanup backlog" (5 items) + §"Phase 2.5 / 3 backlog" (10 items) = 15 items total.
Conventions: CLAUDE.md §"Behavioral defaults" + §"Phase 2 status".
Phase 2 plan (style, TDD pattern, parallel-dispatch mechanics): 2026-04-26-v2-phase2-implementation.md.
Phase 3 plan (in flight on a separate branch): 2026-04-26-v3-phase3-implementation.md.

When a task says "see §X", that's the requirements doc unless stated otherwise.

Pre-flight

Branch: create phase-2.5 from the latest main after Phase 2 has merged. If Phase 2 is still in PR review, branch off phase-2 directly:

# Option A: after main has phase-2 merged
git checkout main && git pull && git checkout -b phase-2.5

# Option B: continue from phase-2 directly
git checkout phase-2 && git pull && git checkout -b phase-2.5

Schema baseline: Phase 2 leaves the DB at version 8. Phase 2.5 adds no migrations. Schema-version assertion in tests/test_world.py stays at 8.

Relationship to Phase 3: Phase 3 (phase-3 branch, plan committed but not yet executed) uses task ids T49–T67. Phase 2.5 uses T68–T75 to avoid collision regardless of merge order.

Pinned non-negotiables (carried forward from Phases 1 + 2):

State changes go through the event log. Use append_and_apply(conn, kind, payload) for the live path; apply_event only after a fresh append_event returning the new id.
Witness filter every memory read at SQL level (hard WHERE constraint; never a soft signal).
Edges are directed; botA → botB and botB → botA are independent records.
Per-POV scene summaries — never write omniscient narration.
TDD: every task starts with a failing test (or, for refactors that preserve behavior, a regression test that pins the existing contract before any change).
One commit per task minimum. Tasks that bundle 3+ small backlog items SHOULD split commits within the task — one commit per backlog item — so review can bisect cleanly.

Verification before claiming done: Use superpowers-extended-cc:verification-before-completion — run the test command, paste actual output. Don't assume green.

Backlog item → task mapping

15 items consolidated into 8 tasks by file ownership (so each wave's tasks stay file-disjoint). Bundled tasks may split commits internally.

#	Backlog item	Source	Task
1	`open_db` refactor with `check_same_thread` parameter	Phase 1.5	T68
2	Regenerate broadcasts `turn_html` over SSE	Phase 1.5	T73
3	`bot_reset` purges orphaned "you" activity rows	Phase 1.5	T69
4	Drawer edits for deferred v1 fields (edge_trust, edge_summary, memory pov_summary, knowledge_facts)	Phase 1.5	T72
5	NICE trim order in prompt assembly	Phase 1.5	T71
6	Interjection regenerate	Phase 2.5	T73
7	Classifier-based addressee detection	Phase 2.5	T74
8	LLM-merged group meta-summary	Phase 2.5	T70
9	First-meeting gate (drawer "have they met?" toggle)	Phase 2.5	T72
10	Witness flag editing in drawer	Phase 2.5	T72
11	Significance for interjection memories	Phase 2.5	T74
12	Stale guest reference defensive degrade removal	Phase 2.5	T73 + T74 (split by file)
13	Scene close on cancel review	Phase 2.5	T74
14	Dual `ACTIVITIES:` block consolidation	Phase 2.5	T71
15	Witness role hardcode in prompt assembly	Phase 2.5	T71
—	Docs sweep — remove shipped items from CLAUDE.md	(this plan)	T75

Parallel-Execution Strategy

Same pattern as Phases 2 and 3. Five waves: parallel within each wave (file-disjoint), serial across waves. Cross-wave merges keep phase-2.5 green between dispatches.

How to dispatch a wave in parallel

Use the Agent tool with isolation: "worktree" so each subagent gets its own git worktree. (If the controlling session's working directory is not the chat repo, create worktrees manually with git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-2.5 from inside the chat repo and pass the worktree path explicitly into each subagent prompt — that is the pattern Phase 2 used.)

In a single message, dispatch all tasks in the wave:

Agent({
  description: "Wave 1 — T68 open_db refactor",
  subagent_type: "general-purpose",
  isolation: "worktree",
  prompt: "<full task text from below>",
})
Agent({ ...T69... })
Agent({ ...T70... })

After a wave completes

Each subagent returns its worktree path and commit SHA(s).
Run a spec + code-quality reviewer subagent on each completed task. Combined review is acceptable for purely mechanical refactors (T68, T69); separate spec + quality reviewers for tasks that bundle multiple backlog items (T71, T72, T74).

Merge the wave into phase-2.5 in any order (file-disjointness guarantees no conflict). Use --no-ff:

git checkout phase-2.5
for branch in <wave-branches>; do
  git merge --no-ff "$branch" -m "merge: <task description>"
done

Run the full test suite on the merged phase-2.5. If it's red, the wave's mutual-independence assumption was violated — bisect the offending pair, fix, re-merge.
Push phase-2.5 to gitea so the work is durable before the next wave starts.
Optionally clean up worktrees: git worktree remove .worktrees/<branch> and git branch -D <branch>.

Conflict prevention checklist (apply before dispatch)

For each parallel wave, verify the Files sections of all tasks have no overlapping paths. The waves below are designed to satisfy this; if you decide to add or merge tasks, re-check.

The hot files in this plan are: chat/web/turns.py, chat/services/regenerate.py, chat/web/drawer.py, chat/templates/_drawer.html, chat/services/prompt.py. Each is owned by exactly one task in this plan.

Failure recovery

If one subagent fails: cancel it, merge the others' successful work, re-dispatch the failed task as a single follow-up. Don't block the wave.

If a failure exposes a bad assumption shared by multiple tasks (e.g., a refactor that requires a wider blast radius than the plan accounted for), pause the wave and revisit.

Why each wave is parallel-safe

Wave	Tasks	Hot files touched	Disjoint?
1	T68, T69, T70	`chat/db/connection.py` + `chat/web/bots.py` (T68); `chat/state/entities.py` (T69); `chat/services/scene_summarize.py` (T70)	✅
2	T71	`chat/services/prompt.py`	(single task)
3	T72	`chat/web/drawer.py` + `chat/templates/_drawer.html`	(single task)
4	T73, T74	`chat/services/regenerate.py` (T73); `chat/web/turns.py` + new `chat/services/addressee.py` (T74)	✅
5	T75	`CLAUDE.md`	(single task)

Task overview

Wave 1 ─┬─ T68: open_db refactor with check_same_thread param
        ├─ T69: bot_reset purges orphaned "you" activity rows
        └─ T70: LLM-merged group meta-summary

Wave 2 ─── T71: prompt.py polish (NICE trim order + dual ACTIVITIES + witness role parametric)

Wave 3 ─── T72: drawer.py polish (deferred v1 edits + first-meeting gate + witness flag editing)

Wave 4 ─┬─ T73: regenerate.py polish (turn_html SSE + interjection regenerate + stale-guest cleanup)
        └─ T74: turn-flow polish + addressee service (classifier addressee detection +
                significance for interjection + scene close on cancel + stale-guest cleanup)

Wave 5 ─── T75: docs sweep — remove shipped items from CLAUDE.md backlogs

Critical path: 5 sequential merge points. Total tasks: 8. Wall-clock parallelism advantage: Waves 1 and 4 dispatch concurrently; Waves 2, 3, 5 are single-task by file constraint.

Wave 1 — Independent small fixes (parallel)

Three tasks, fully file-disjoint.

Task 68: `open_db` refactor with `check_same_thread` parameter

Files:

Modify: chat/db/connection.py (extend open_db(path, *, check_same_thread=True) so callers can opt out of SQLite's main-thread requirement)
Modify: chat/web/bots.py (use the new parameter in get_conn rather than hand-rolling its own context-manager body)
Modify: tests in tests/test_connection.py (or wherever open_db is tested; add 1 test for the new parameter)

Spec: Currently chat/web/bots.py:get_conn() duplicates the body of open_db so it can pass check_same_thread=False. Extend open_db to accept this as a kwarg (default True, preserving existing behavior). Then have get_conn call open_db(...) directly. The PRAGMA setup (WAL, foreign_keys, synchronous, etc.) stays in one place.

Step 1: failing test — add a regression test that pins the existing contract:

def test_open_db_default_uses_check_same_thread_true(tmp_path):
    db = tmp_path / "t.db"
    apply_migrations(db)
    with open_db(db) as conn:
        # Default is check_same_thread=True; calling from another thread should fail.
        ...

def test_open_db_can_disable_check_same_thread(tmp_path):
    db = tmp_path / "t.db"
    apply_migrations(db)
    with open_db(db, check_same_thread=False) as conn:
        # Same conn callable from another thread now.
        ...

Step 3: implementation — add check_same_thread: bool = True to open_db. Pass through to sqlite3.connect. Then in chat/web/bots.py, replace the duplicated context-manager body with open_db(path, check_same_thread=False).

Step 5: commit — refactor: open_db with check_same_thread parameter (T68).

Notes for implementer:

This is a refactor — the full test suite must be GREEN before AND after. Run before to baseline, run after to confirm no regressions. Pay special attention to tests/test_bots.py if it exercises the get_conn path.
Do NOT change the default. Existing callers don't pass check_same_thread and must continue to get True.

Task 69: `bot_reset` purges orphaned "you" activity rows

Files:

Modify: chat/state/entities.py (extend _apply_bot_reset with one more DELETE clause for "you" activity rows tied to chats that this bot hosted)
Modify: tests in tests/test_reset.py (add 2 tests)

Spec: Currently _apply_bot_reset purges the bot's chats, the bot's own activity rows, the bot's memories, and edges involving the bot. Phase 2 T47 added a chats.guest_bot_id cascade. Still missing: when bot A's chats are deleted, "you"-owned activity rows that were associated with those chats' containers are not cleaned up. They linger as orphaned activity entries pointing at deleted containers.

The fix per the existing CLAUDE.md note:

DELETE FROM activity
WHERE entity_id = 'you'
  AND container_id IN (SELECT id FROM containers WHERE chat_id IN (
      SELECT id FROM chats WHERE host_bot_id = ?
  ));

Order matters: this DELETE must run BEFORE the DELETE FROM containers and DELETE FROM chats clauses — otherwise the subqueries return no rows. Verify ordering in the existing handler before placing the new line.

Tests: 2 added.

test_reset_purges_orphaned_you_activity_rows: seed bot_a, chat_bot_a, a container in chat_bot_a, and a "you" activity row pointing at that container. Reset bot_a. Assert SELECT COUNT(*) FROM activity WHERE entity_id = 'you' is 0.
test_reset_does_not_purge_you_activity_in_other_chats: seed bot_a + bot_b, both with chats and "you" activity in each. Reset bot_a. Assert "you" activity in chat_bot_a is gone, but "you" activity in chat_bot_b is preserved.

Commit: fix: bot_reset purges orphaned 'you' activity rows (T69).

Task 70: LLM-merged group meta-summary

Files:

Modify: chat/services/scene_summarize.py (replace the naive f"{host_name}: {host_summary}\n\n{guest_name}: {guest_summary}" with an LLM-merged group view via a new classifier wrapper)
Modify: tests in tests/test_per_pov_summary.py (replace the regression test for naive concat with one that asserts the merged text uses the classifier output; keep the existing per-POV memory tests intact)

Spec: Phase 2 T45 wrote a stub for group_node.summary that just concatenated the two per-POV summaries. Replace it with a small classifier call that produces a coherent group-level summary from both POVs.

Add a new helper at the bottom of scene_summarize.py:

class GroupMetaSummary(BaseModel):
    summary: str = ""
    dynamic: str = ""

async def merge_group_summary(
    client: LLMClient,
    *,
    classifier_model: str,
    host_name: str,
    host_pov_summary: str,
    guest_name: str,
    guest_pov_summary: str,
    timeout_s: float = 30.0,
) -> GroupMetaSummary:
    """Merge two per-POV scene summaries into a coherent group-level
    summary + group-dynamic note. Falls back to the naive concat on
    classifier failure."""

System prompt: "Given two per-POV scene summaries from a 3-entity scene (you + host + guest), produce a coherent group-level summary capturing the shared events as both witnesses experienced them, plus a brief 'dynamic' note describing the trio's group dynamic during the scene." Output strict JSON matching schema. Default = GroupMetaSummary(summary=f"{host_name}: {host_pov_summary}\n\n{guest_name}: {guest_pov_summary}", dynamic="") (the existing naive concat preserved as fallback so a classifier failure doesn't degrade behavior).

In apply_scene_close_summary, replace the naive concat call site (the existing summary= kwarg of the group_node_updated event) with await merge_group_summary(...) and use its .summary and .dynamic outputs.

Tests: 3 in tests/test_per_pov_summary.py.

test_group_summary_merges_per_pov_via_classifier_when_guest_present: mock the classifier with GroupMetaSummary(summary="merged summary", dynamic="warm rapport"). Close a scene with guest. Assert get_group_node(...).summary == "merged summary" and .dynamic == "warm rapport".
test_group_summary_falls_back_to_naive_concat_on_classifier_failure: mock classifier with bad JSON across all 3 retries. Close scene. Assert summary matches the old naive concat format. dynamic is empty.
test_group_summary_skipped_when_no_guest: no-guest path unchanged — group_node_updated not emitted at all (existing behavior).

Commit: feat: LLM-merged group meta-summary (T70).

Wave 2 — `prompt.py` polish (single task)

T71 bundles three prompt-assembly cleanups. All touch chat/services/prompt.py. Single task because the file is hot; the implementer SHOULD split into 3 commits within the task for clean review bisection.

Task 71: prompt.py polish (NICE trim order + dual ACTIVITIES + witness role parametric)

Files:

Modify: chat/services/prompt.py
Modify: tests/test_prompt.py (add tests; preserve existing 10 tests)

Spec: Three independent cleanups bundled because the file is hot.

71.1 — Witness role parametric (Phase 2.5 backlog #15)

chat/services/prompt.py:436 (or wherever the call site is — verify) calls search_memories(conn, speaker_bot_id, "host", query, k=4) with witness_role="host" hardcoded. This is wrong when the speaker is the guest (the guest queries with witness_role="guest" should hit a different SQL filter).

Fix: derive the role from chat membership.

def _witness_role_for(speaker_bot_id: str, host_bot_id: str) -> str:
    return "host" if speaker_bot_id == host_bot_id else "guest"

Apply at the call site. The test contract is already pinned in tests/test_witness_filter_multi.py from Phase 2 T46 — those tests will continue to pass; this change unblocks guest-as-speaker in production.

Commit: fix: witness role parametric in prompt assembly (T71.1).

71.2 — Dual `ACTIVITIES:` block consolidation (Phase 2.5 backlog #14)

T43 (Phase 2) added a second ACTIVITIES: block to render guest activity separately from you+speaker activity (so the trim ladder could drop guest activity first under tight budget). Two consecutive ACTIVITIES: headers can read as a duplicate-section bug to the LLM.

Refactor to a single ACTIVITIES: block with three bullets (you, speaker, guest), where each bullet is independently trimmable: under tight budget, drop the guest bullet first, then the you bullet, keeping the speaker bullet (the speaker's own current activity is MUST-tier).

Implementation: the existing trim machinery uses block-level granularity. Extend it to bullet-level granularity for this block (one new helper or one new tier name like MUST-bullet / SHOULD-bullet / NICE-bullet — pick whichever is least disruptive).

Commit: refactor: single ACTIVITIES: block with bullet-level trim (T71.2).

71.3 — NICE trim order revisit (Phase 1.5 backlog #5)

Per T18 review: the NICE trim drops previous-scene first instead of last (the spec listing order was previous-scene last). Greedy-cuts heuristic vs. spec.

Revisit: review the trim ordering carefully. If real play surfaces a regression (the previous-scene block is genuinely important to bot continuity), reverse the NICE order so previous-scene drops last. If not, document the intentional deviation in a code comment and call it done.

This is a judgment call. Default action: leave the order as-is and add a comment explaining why (the heuristic is "drop the cheapest-impact thing first; greedy lookahead is more expensive than the marginal narrative loss"). If review feedback during execution disagrees, reverse the order.

Commit: chore: document NICE trim order rationale (T71.3) OR fix: NICE trim order drops previous-scene last (T71.3).

Tests for T71

Add to tests/test_prompt.py:

test_speaker_is_guest_uses_guest_witness_role: speaker=guest_id. Patch search_memories to record its witness_role argument. Assert called with "guest", not "host".
test_single_activities_block_with_three_bullets_when_3_entities: 3-entity prompt. Assert exactly one ACTIVITIES: header present. Assert bullets for you, speaker, guest.
test_tight_budget_drops_guest_activity_bullet_first: 3-entity prompt with budget tight enough to force trim. Assert speaker activity bullet survives, guest activity bullet is dropped.
(Optional, depends on 71.3 outcome) test_nice_trim_order_drops_previous_scene_last: only add if you choose to fix the order.

Verification gates:

pytest tests/test_prompt.py -v — 10 existing + 3-4 new all pass.
pytest tests/test_witness_filter_multi.py -v — Phase 2 T46 tests still pass (proves the witness-role fix didn't break anything).
Full suite green.

Wave 3 — `drawer.py` polish (single task)

T72 bundles three drawer affordances. All touch chat/web/drawer.py and chat/templates/_drawer.html. Single task by file constraint; implementer SHOULD split into 3 commits.

Task 72: drawer polish (deferred v1 edits + first-meeting gate + witness flag editing)

Files:

Modify: chat/web/drawer.py (add 4-5 new POST routes for the deferred v1 edits + 1 GET extension for first-meeting gate + 1 POST for witness flag editing)
Modify: chat/templates/_drawer.html (forms for each new edit affordance)
Create: tests/test_drawer_edits_extended.py (new tests for the new routes; existing tests/test_drawer_edits.py and tests/test_drawer_guest.py stay unchanged)

Spec: Three independent backlog items.

72.1 — Deferred v1 drawer edits (Phase 1.5 backlog #4)

The manual_edit projector already supports target_kind values for edge_trust, edge_summary, memory_pov_summary. These work end-to-end at the state layer; only the drawer routes are missing.

Add 4 new POST routes:

POST /chats/{chat_id}/drawer/edge/trust — form {source_id, target_id, new_value} (0–100 int). Appends manual_edit with target_kind="edge_trust", prior_value=current_trust, new_value=.... Validate range; 400 on out-of-bounds.
POST /chats/{chat_id}/drawer/edge/summary — form {source_id, target_id, new_summary} (text). Appends manual_edit with target_kind="edge_summary". No validation beyond non-empty + reasonable length cap (e.g., 2000 chars).
POST /chats/{chat_id}/drawer/memory/pov-summary — form {memory_id, new_summary}. Appends manual_edit with target_kind="memory_pov_summary". 404 if memory not in this chat or not owned by a present bot.
POST /chats/{chat_id}/drawer/edge/knowledge-facts — form {source_id, target_id, action: 'add'|'remove', fact: str}. Knowledge_facts needs a NEW dispatch branch in the manual_edit projector — add it as part of this task: target_kind="edge_knowledge_fact" with payload action + fact.

The existing drawer template has read-only renders for these fields. Replace with editable forms (textarea + slider + button).

Tests in tests/test_drawer_edits_extended.py:

One test per route (4 tests minimum) asserting: the manual_edit event lands; the projected state changes; the response contains the updated drawer partial.

Commit: feat: drawer edits for edge_trust / edge_summary / memory_pov_summary / knowledge_facts (T72.1).

72.2 — First-meeting gate (Phase 2.5 backlog #9)

The "Add guest" form's relationship_prose textarea fires every time. In Phase 2 T42's notes: "fire it every time a (host, guest) pair has no existing host → guest edge."

Implement the gate: when the user opens the Add-guest form, check whether get_edge(conn, host_bot_id, guest_bot_id) already exists. If yes:

Render the textarea disabled with the message "they already know each other (edge exists from a prior chat)" + a small "re-seed anyway" toggle that re-enables the textarea.
If the user submits without toggling, skip the relationship-seed call (existing edge content stays).
If the user toggles re-seed and submits prose, the existing flow runs — seed_inter_bot_edges produces deltas, two edge_update events fire on top of the existing edge content.

Tests:

test_add_guest_form_disables_prose_when_edge_exists: pre-seed a host→guest edge from a prior chat; render the form; assert the textarea has disabled attribute AND the "they already know each other" message is in the body.
test_add_guest_with_existing_edge_skips_seed_call: pre-seed edge; submit form without toggling re-seed; assert classifier mock was NOT called (count check on canned-response queue).

Commit: feat: first-meeting gate on drawer Add-guest form (T72.2).

72.3 — Witness flag editing (Phase 2.5 backlog #10)

Memories show witness flags [you, host, guest] read-only in the drawer. Add an inline-edit affordance: each flag becomes a checkbox; toggling submits a manual_edit event with target_kind="memory_witness", payload {memory_id, flag: 'you'|'host'|'guest', new_value: bool}.

The manual_edit projector needs a new dispatch branch for memory_witness — same as the knowledge_facts branch in 72.1; do them together if cleaner.

Tests: 2.

test_witness_flag_toggle_updates_memory_row: seed memory with witness [1, 1, 0]. POST toggle on guest flag → 1. Project. Assert memories.witness_guest = 1.
test_witness_flag_toggle_emits_manual_edit_event: same setup; assert the manual_edit event has the right target_kind and prior_value/new_value.

Commit: feat: drawer witness flag inline-edit (T72.3).

Wave 4 — Turn-flow polish (parallel)

Two tasks, file-disjoint. T73 owns chat/services/regenerate.py; T74 owns chat/web/turns.py + adds a new addressee-detection service.

Each task bundles multiple backlog items. Implementer should split commits within each task.

Task 73: `regenerate.py` polish

Files:

Modify: chat/services/regenerate.py
Modify: tests/test_regenerate.py (add tests; existing tests preserved)

Spec: Three regenerate-related backlog items.

73.1 — Regenerate broadcasts `turn_html` over SSE (Phase 1.5 backlog #2)

After the new assistant_turn lands, broadcast a turn_html event over the chat's pub/sub channel — mirror the broadcast logic in chat/web/turns.py:post_turn. The existing post_turn does this via publish(chat_id, {"event": "turn_html", "html": ...}) (or similar — verify). Use the same render path so connected tabs swap the regenerated turn live, no refresh required.

Test: test_regenerate_broadcasts_turn_html_over_sse — mock publish and assert it was called with the new assistant_turn's rendered HTML.

Commit: feat: regenerate broadcasts turn_html over SSE (T73.1).

73.2 — Interjection regenerate (Phase 2.5 backlog #6)

Phase 2 T44 deferred interjection regenerate: regenerate currently only acts on the addressee turn. Extend so that when a turn group has both a primary assistant_turn and an assistant_turn flagged as interjection_of=..., regenerate redoes BOTH — the primary first, then the interjection (using the same interjection-decision classifier path as post_turn). The interjection branch may decide should_interject=False on the regenerate, in which case the previous interjection_turn is superseded but no new interjection is appended.

Test: test_regenerate_with_interjection_redoes_both_turns — seed a 3-entity scene with a prior primary + interjection; regenerate; assert two new assistant_turns land (or one new + a supersede-without-replace if the regenerated decision was "no interjection").

Commit: feat: regenerate covers interjection turns (T73.2).

73.3 — Stale-guest defensive degrade cleanup in regenerate.py (Phase 2.5 backlog #12, partial)

Phase 2 T44 added a defensive degrade-to-1:1 in regenerate.py when chat.guest_bot_id points at a deleted bot. T47 fixed the root cause (resets clear the reference). The defensive degrade is now dead code.

Remove the degrade block; let the function trust that chat.guest_bot_id is either valid or NULL. The corresponding existing test for the defensive degrade can be removed (the bot_reset cascade test in tests/test_reset.py already covers the root-cause behavior).

Commit: chore: remove defensive stale-guest degrade in regenerate.py (T73.3).

Verification gates

pytest tests/test_regenerate.py -v — existing + new all pass.
Full suite green.

Task 74: turn-flow polish + new addressee-detection service

Files:

Modify: chat/web/turns.py
Create: chat/services/addressee.py (new classifier wrapper for addressee detection)
Create: tests/test_addressee.py
Modify: tests/test_turn_flow.py (add tests; existing 8 tests preserved)

Spec: Four turn-flow backlog items.

74.1 — Classifier-based addressee detection (Phase 2.5 backlog #7)

Phase 2 T44's _detect_addressee_id uses a substring whole-word regex match. This is brittle: bot names that are common English words (e.g., a bot named "Sam"), names appearing inside a quoted aside ("Did you see what Sam wrote in his letter?" — addressed to host, not Sam), or fuzzy references all break it.

Replace with a small classifier call. New module chat/services/addressee.py:

class AddresseeDecision(BaseModel):
    addressee_id: str  # bot id, "you", or "host" as fallback
    confidence: str = "medium"  # "high" | "medium" | "low"
    reason: str = ""

async def detect_addressee(
    client: LLMClient,
    *,
    classifier_model: str,
    user_prose: str,
    host_id: str,
    host_name: str,
    guest_id: str | None,
    guest_name: str | None,
    timeout_s: float = 30.0,
) -> AddresseeDecision:
    """Classify which present bot the user is addressing in this turn.
    Defaults to host on failure or low confidence."""

System prompt: "Given a user's turn prose and the names of present bots, decide which bot the user is addressing. If the user is speaking to no specific bot (descriptive narration, action without dialogue), default to the host. Output strict JSON."

Default fallback (classifier failure) = AddresseeDecision(addressee_id=host_id, confidence="low", reason="fallback").

In chat/web/turns.py, replace _detect_addressee_id with a call to detect_addressee. Keep the substring helper as a low-confidence pre-filter for the no-guest case (no LLM call needed when only one bot is present — preserves throughput).

Tests:

tests/test_addressee.py (new file): 3 tests — classifier returns guest, classifier returns host, classifier failure falls back to host.
tests/test_turn_flow.py: update test_addressee_detection_routes_to_named_bot from Phase 2 T44 to use the new classifier path. (Existing test should keep passing with the new mock orchestration; canned-response queue may need an extra slot for the addressee decision.)

Commit: feat: classifier-based addressee detection (T74.1).

74.2 — Significance for interjection memories (Phase 2.5 backlog #11)

Phase 2 T44 noted: the interjection branch's memory_written event doesn't enqueue a SignificanceJob. Wire it in: after the interjection memory write (the record_turn_memory_for_present call in the interjection branch), enqueue a SignificanceJob with the interjection's host memory id (mirror the primary turn's enqueue at the end of the primary branch).

If both host and guest memory ids exist for the interjection (as they will when both are present), enqueue once for the host id (the existing pattern for primary turns — the score applies to both POVs since the prose is identical at the time of write).

Test: test_interjection_enqueues_significance_job — mock the worker; trigger an interjection; assert SignificanceJob was enqueued with the interjection memory id.

Commit: fix: enqueue significance for interjection memories (T74.2).

74.3 — Scene close on cancel review (Phase 2.5 backlog #13)

Phase 2 T44 review noted: when a primary turn is cancelled mid-stream, scene close still runs. Behavior may be intentional (close detection looks at user prose, not bot output) or wrong (a cancelled turn is incomplete; closing the scene on it is premature).

Decision for this task: review the call path. If the close detection truly only consults user prose AND the user prose is fully present at the moment of cancel (it is — user prose is appended before the stream starts), the existing behavior is correct: a cancelled turn doesn't invalidate the user's intent to close the scene. Document this in a code comment near the close-detection branch.

If a play-test surfaces a regression (e.g., a user cancels because the bot misread their close intent), revisit. Default: document and close as a no-op.

Test: test_cancelled_turn_still_closes_scene_when_user_prose_signals_close — pin the existing behavior so a future refactor doesn't quietly change it.

Commit: chore: pin scene-close-on-cancel behavior + comment rationale (T74.3).

74.4 — Stale-guest defensive degrade cleanup in turns.py (Phase 2.5 backlog #12, partial)

Same as T73.3 but for chat/web/turns.py: T44's defensive degrade-to-1:1 in post_turn (lines 235-242 per the T44 implementer note) is dead code now that T47 fixed the root cause. Remove it.

Commit: chore: remove defensive stale-guest degrade in turns.py (T74.4).

Verification gates

pytest tests/test_addressee.py -v — 3/3 new tests pass.
pytest tests/test_turn_flow.py -v — existing 8 + new 2-3 all pass.
pytest tests/test_reset.py -v — Phase 2 T47 root-cause cascade still green.
Full suite green.

Wave 5 — Docs sweep (single task)

Task 75: Remove shipped items from CLAUDE.md backlogs

Files:

Modify: CLAUDE.md

Spec: Walk through the 15 backlog items in CLAUDE.md §"Phase 1.5 cleanup backlog" and §"Phase 2.5 / 3 backlog". For each item shipped during Phases 2.5 (T68–T74), remove it from the backlog list. Add a new section "Phase 2.5 status" near the existing "Phase 2 status" section listing what shipped:

open_db refactor (T68).
bot_reset purges orphaned "you" activity rows (T69).
LLM-merged group meta-summary (T70).
Prompt assembly polish: witness role parametric, single ACTIVITIES block, NICE trim documented (T71).
Drawer edits for deferred v1 fields, first-meeting gate, witness flag editing (T72).
Regenerate over SSE + interjection regenerate + stale-guest cleanup (T73).
Classifier-based addressee detection + significance for interjection + scene-close-on-cancel pinned + stale-guest cleanup (T74).

If any task during execution chose NOT to ship a sub-item (e.g., T71.3 left NICE trim unchanged with a documented rationale), keep that sub-item in a "Phase 3.5+ deferred" section with the rationale. The goal is for the backlog list to reflect actual repo state, not aspirational scope.

If any new follow-ups were discovered during T68–T74 reviews, add them to the appropriate backlog section.

Commit: docs: phase 2.5 status, prune shipped backlog items (T75).

Wrap-up

After Wave 5 lands:

Run full suite on phase-2.5: should be ~225+ tests passing (212 from Phase 2 + ~15 new across the 8 tasks).
Manual smoke (recommended before opening the PR):
- Drawer: edit edge_trust on a chat; verify the new value sticks after refresh.
- Drawer: edit edge_summary on a chat; refresh; verify.
- Drawer: toggle a memory's witness flag; refresh; verify.
- Drawer: open Add-guest form for a (host, guest) pair that already shares an edge; verify the gate disables the prose textarea.
- Drawer: open Add-guest form for a fresh pair; verify the textarea is enabled.
- Reset a bot; verify "you" activity rows for that bot's chats are gone (run sqlite3 data/db.sqlite "SELECT * FROM activity WHERE entity_id='you'" before/after).
- Multi-tab: open two tabs on the same chat; click Regenerate on one; verify the other tab sees the new turn live (no refresh).
- Trigger an interjection turn; check the worker queue or significance_jobs table; verify a job was enqueued for the interjection memory.
- Use a bot with a name that's a common word ("Sam"); ask "did you see what Sam wrote?" — verify host gets the floor (classifier addressee detection, not substring).
Push phase-2.5 to gitea.
Open PR phase-2.5 → main.
No new Phase 3+ backlog items expected — if review surfaces any, add to CLAUDE.md.

Notes for the controller running this plan

Don't dispatch Wave 4 until Wave 3 is merged AND tested green on phase-2.5. T74 references the new addressee service path that's stand-alone, but the existing tests in tests/test_turn_flow.py may have shifted from Wave 3 if the drawer-test fixture interactions touch shared state. Verify green before fanning out.
After each parallel wave, run a code-review subagent (subagent-driven-development skill's two-stage review pattern) on each task. For purely mechanical tasks (T68, T69), combined spec+quality is acceptable. For bundled tasks (T71, T72, T74), use separate spec + quality reviewers — the surface area is larger.
If Phase 3 (phase-3 branch) is in flight in parallel, T75 (the docs sweep) should land on phase-2.5 only — Phase 3's docs sweep (T67) is independent. Both will resolve when the two branches merge to main in some order; expect a small CLAUDE.md merge to reconcile any overlapping backlog edits.
If a task's "split commits" guidance proves impractical (e.g., bundling means a test pins 3 fixes at once), one consolidated commit is acceptable. The split is an aid for review bisection, not a hard rule.
Token-spend rough estimate: Phase 2.5 should be ~50% the size of Phase 2 (smaller scope, all reuse). Per-task token spend similar to Phase 2's smaller tasks (T36, T37, T47).
DO NOT break existing v1 / v2 surface contracts. Every test file that was green at the start of Phase 2.5 must stay green at the end. The tests/test_witness_filter_multi.py contracts pinned in Phase 2 T46 are particularly load-bearing for T71.1 — verify them after the witness-role parametric fix lands.

36 KiB Raw Blame History Unescape Escape

Roleplay Engine — Phase 2.5 Cleanup Plan

Pre-flight

Backlog item → task mapping

Parallel-Execution Strategy

How to dispatch a wave in parallel

After a wave completes

Conflict prevention checklist (apply before dispatch)

Failure recovery

Why each wave is parallel-safe

Task overview

Wave 1 — Independent small fixes (parallel)

Task 68: open_db refactor with check_same_thread parameter

Task 69: bot_reset purges orphaned "you" activity rows

Task 70: LLM-merged group meta-summary

Wave 2 — prompt.py polish (single task)

Task 71: prompt.py polish (NICE trim order + dual ACTIVITIES + witness role parametric)

71.1 — Witness role parametric (Phase 2.5 backlog #15)

71.2 — Dual ACTIVITIES: block consolidation (Phase 2.5 backlog #14)

71.3 — NICE trim order revisit (Phase 1.5 backlog #5)

Tests for T71

Wave 3 — drawer.py polish (single task)

Task 72: drawer polish (deferred v1 edits + first-meeting gate + witness flag editing)

72.1 — Deferred v1 drawer edits (Phase 1.5 backlog #4)

72.2 — First-meeting gate (Phase 2.5 backlog #9)

72.3 — Witness flag editing (Phase 2.5 backlog #10)

Wave 4 — Turn-flow polish (parallel)

Task 73: regenerate.py polish

73.1 — Regenerate broadcasts turn_html over SSE (Phase 1.5 backlog #2)

73.2 — Interjection regenerate (Phase 2.5 backlog #6)

73.3 — Stale-guest defensive degrade cleanup in regenerate.py (Phase 2.5 backlog #12, partial)

Verification gates

Task 74: turn-flow polish + new addressee-detection service

74.1 — Classifier-based addressee detection (Phase 2.5 backlog #7)

74.2 — Significance for interjection memories (Phase 2.5 backlog #11)

74.3 — Scene close on cancel review (Phase 2.5 backlog #13)

74.4 — Stale-guest defensive degrade cleanup in turns.py (Phase 2.5 backlog #12, partial)

Verification gates

Wave 5 — Docs sweep (single task)

Task 75: Remove shipped items from CLAUDE.md backlogs

Wrap-up

Notes for the controller running this plan

36 KiB

Raw Blame History

Task 68: `open_db` refactor with `check_same_thread` parameter

Task 69: `bot_reset` purges orphaned "you" activity rows

Wave 2 — `prompt.py` polish (single task)

71.2 — Dual `ACTIVITIES:` block consolidation (Phase 2.5 backlog #14)

Wave 3 — `drawer.py` polish (single task)

Task 73: `regenerate.py` polish

73.1 — Regenerate broadcasts `turn_html` over SSE (Phase 1.5 backlog #2)