12 tasks across 7 waves consolidating the 17-item backlog tracked in CLAUDE.md (7 from Phase 2.6/3 + 10 from Phase 3.5/4). Items are grouped by file ownership so each wave stays file-disjoint: - Wave 1 (parallel 4-way): trivial single-line/single-file fixes (timeout_s plumbing, Literal type, docstring, defensive None) - Wave 2 (single): scene_summarize.py polish bundle (5 T58 items) - Wave 3 (single): typed ChatNotFoundError for skip routes - Wave 4 (single): turns.py wiring (consume_pending_meanwhile_digests + natural-language skip runs scene close detection) - Wave 5 (single): regenerate.py polish (cancel hook + DRY + sibling query + lifecycle rollback documentation + ordering) - Wave 6 (parallel 3-way): unified record_turn_memory API + JSON audit + frontend turn_html_replace SSE handler - Wave 7 (single): docs sweep No schema migrations. Bundled tasks split into per-item sub-commits for clean review bisection. Uses task ids T76-T87 to avoid collision with prior phases (Phase 3 used T49-T67, Phase 2.5 used T68-T75).
39 KiB
Roleplay Engine — Phase 3.5 Cleanup Plan
For Claude: REQUIRED SUB-SKILL: Use
superpowers-extended-cc:executing-plansto implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for waves that fan out to multiple subagents.
Goal: Burn down the combined Phase 2.6/3 + Phase 3.5/4 backlog tracked in CLAUDE.md. 17 follow-up items consolidated into 12 tasks (file-disjoint where possible) across 7 waves so several can run in parallel.
Architecture: No new architecture, no schema migrations, no new event kinds. Every change here is either a polish on an existing service/route, a refactor for DRY/typing/observability, or a UI affordance for state that already exists (frontend turn_html_replace consumer).
Tech Stack: Same as Phase 3. No new dependencies.
Source-of-truth references:
- Backlog list:
CLAUDE.md§"Phase 2.6 / 3 backlog" (7 items, 6 actionable + 1 deferred) + §"Phase 3.5 / 4 backlog" (~13 items grouped by source review) = 17 items net. - Conventions:
CLAUDE.md§"Behavioral defaults" + §"Phase 3 status". - Phase 2.5 cleanup plan (style, file-bundling pattern): 2026-04-26-v2.5-phase2.5-cleanup.md.
When a task says "see §X", that's the requirements doc unless stated otherwise.
Pre-flight
Branch: create phase-3.5 from the latest main after Phase 3 has merged (it has — main is at 753cec3):
git checkout main && git pull && git checkout -b phase-3.5
Schema baseline: Phase 3 leaves the DB at version 11. Phase 3.5 adds no migrations. Schema-version assertion in tests/test_world.py stays at 11.
Pinned non-negotiables (carried forward):
- State changes go through the event log. Use
append_and_apply(conn, kind, payload)for the live path;apply_eventonly after a freshappend_eventreturning the new id. - Witness filter every memory read at SQL level (hard
WHEREconstraint; never a soft signal). - Edges are directed;
botA → botBandbotB → botAare independent records. - Per-POV scene summaries — never write omniscient narration.
- TDD: every task starts with a failing test (or a regression test pinning existing contract before refactor).
- One commit per task minimum. Tasks that bundle multiple backlog items SHOULD split commits within the task — one commit per item — so review can bisect cleanly.
Verification before claiming done: Use superpowers-extended-cc:verification-before-completion — run the test command, paste actual output. Don't assume green.
Backlog item → task mapping
17 items consolidated into 12 tasks by file ownership (so each wave's tasks stay file-disjoint). Bundled tasks split commits internally.
| # | Backlog item | Source | Task |
|---|---|---|---|
| 1 | narrate_skip timeout_s not piped through |
T53 review | T76 |
| 2 | AddresseeDecision.confidence should be Literal[...] |
T74 review (Phase 2.5 carry-over) | T77 |
| 3 | search_memories docstring missing SQL-bias note |
T57 review | T78 |
| 4 | _witness_role_for defensive coding for host_bot_id is None |
T71 review (Phase 2.5 carry-over) | T79 |
| 5 | Scene close re-close suffix bloat risk | T58 review | T80 |
| 6 | Thread detection transcript scoping (chat-wide vs scene-scoped) | T58 review | T80 |
| 7 | Swallowed Exception in detect_threads try/except — log at debug |
T58 review | T80 |
| 8 | Scene close closed_at clock divergence |
T58 review | T80 |
| 9 | T58 test coverage gaps (truncation, update/close, fallback) | T58 review | T80 |
| 10 | Error-message prefix sniff for 404 vs 400 routing in skip routes | T62 review | T81 |
| 11 | consume_pending_meanwhile_digests not wired into post_turn |
T66 integration tests | T82 |
| 12 | Skip command bypasses scene close detection | T62 review | T82 |
| 13 | Cancel/stop hook for in-flight regenerate streams | T73 review (Phase 2.5 carry-over) | T83 |
| 14 | DRY: regenerate vs post_turn (recent-dialogue + prior-edges duplication) | T73 review (Phase 2.5 carry-over) | T83 |
| 15 | Sibling-discovery query optimization in regenerate.py | T73 review (Phase 2.5 carry-over) | T83 |
| 16 | Regenerate doesn't roll back lifecycle transitions from superseded turn | T61 review | T83 |
| 17 | Asymmetry in event-detection ordering (turns.py vs regenerate.py) | T61 review | T83 |
| 18 | record_meanwhile_memory / record_turn_memory_for_present unified API |
T64 review | T84 |
| 19 | participants_json JSON-build audit (other state modules) |
T63 review | T85 |
| 20 | Stop-button cancellation route-level coverage for meanwhile turns | T64 review | T85 |
| 21 | Frontend handler for turn_html_replace SSE event |
T73.1 review (Phase 2.5 carry-over) | T86 |
| — | Docs sweep — remove shipped items, capture residuals | (this plan) | T87 |
Deferred (not in this plan):
- Cross-feature canned-queue brittleness (Wave 6b cross-feature): would require a structured-fixture refactor across 3+ test files. Tracking but not in scope here — open as its own work if it surfaces a real test-maintenance pain.
- Scene-close-on-cancel UX revisit (T74.3): pinned to existing behavior; no action needed unless real play surfaces a regression.
Parallel-Execution Strategy
Same pattern as Phases 2.5 and 3. Seven waves: parallel within each wave (file-disjoint), serial across waves.
How to dispatch a wave in parallel
Use the Agent tool with isolation: "worktree" so each subagent gets its own git worktree. (If the controlling session's working directory is not the chat repo, create worktrees manually with git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-3.5 from inside the chat repo and pass the worktree path explicitly into each subagent prompt.)
In a single message, dispatch all tasks in the wave:
Agent({
description: "Wave 1 — T76 narrate_skip timeout_s",
subagent_type: "general-purpose",
isolation: "worktree",
prompt: "<full task text from below>",
})
Agent({ ...T77... })
Agent({ ...T78... })
Agent({ ...T79... })
After a wave completes
- Each subagent returns its worktree path and commit SHA(s).
- Run a spec + code-quality reviewer subagent on each completed task. Combined review acceptable for purely mechanical fixes (T76–T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83).
- Merge the wave into
phase-3.5in any order (file-disjointness guarantees no conflict). Use--no-ff. - Run the full test suite on the merged
phase-3.5. If red, the wave's mutual-independence assumption was violated — bisect, fix, re-merge. - Push
phase-3.5to gitea. - Optionally clean up worktrees:
git worktree remove .worktrees/<branch>andgit branch -D <branch>.
Conflict prevention checklist
For each parallel wave, verify the Files sections of all tasks have no overlapping paths. Hot files in this plan: chat/web/turns.py (T82 only), chat/services/regenerate.py (T83 only), chat/web/skip.py + chat/web/drawer.py (T81 only). Each is owned by exactly one task.
Failure recovery
If one subagent fails: cancel it, merge the others' successful work, re-dispatch the failed task as a single follow-up. Don't block the wave.
Why each wave is parallel-safe
| Wave | Tasks | Hot files touched | Disjoint? |
|---|---|---|---|
| 1 | T76, T77, T78, T79 | 4 different services/state files; no overlap | ✅ |
| 2 | T80 | chat/services/scene_summarize.py |
(single task) |
| 3 | T81 | chat/web/skip.py + chat/web/drawer.py (typed exception) |
(single task) |
| 4 | T82 | chat/web/turns.py |
(single task) |
| 5 | T83 | chat/services/regenerate.py (+ optional new shared-helpers module) |
(single task) |
| 6 | T84, T85, T86 | chat/services/memory_write.py (T84); audit + tests (T85); frontend HTML/JS (T86) |
✅ |
| 7 | T87 | CLAUDE.md |
(single task) |
Task overview
Wave 1 ─┬─ T76: narrate_skip timeout_s plumbed through (skip_narration.py)
├─ T77: AddresseeDecision.confidence as Literal (addressee.py)
├─ T78: search_memories docstring SQL-bias note (memory.py)
└─ T79: _witness_role_for defensive None handling (prompt.py)
Wave 2 ─── T80: scene_summarize.py polish bundle (5 T58 items)
Wave 3 ─── T81: typed exception for skip controllers (skip.py + drawer.py)
Wave 4 ─── T82: turns.py wiring (consume_pending_meanwhile_digests + skip-bypass scene close)
Wave 5 ─── T83: regenerate.py polish bundle (cancel hook + DRY + sibling query +
lifecycle rollback + ordering asymmetry)
Wave 6 ─┬─ T84: memory_write.py unified record-memory API
├─ T85: JSON-build audit + meanwhile stop-button route-level test
└─ T86: frontend turn_html_replace SSE handler
Wave 7 ─── T87: docs sweep — prune CLAUDE.md backlogs, capture residuals
Critical path: 7 sequential merge points. Total tasks: 12. Wall-clock parallelism advantage: Waves 1 and 6 dispatch concurrently (4-way and 3-way respectively). Waves 2, 3, 4, 5, 7 are single-task by hot-file constraint.
Wave 1 — Independent small fixes (parallel, 4 tasks)
All 4 tasks are tiny (1-line + tests). Fully file-disjoint.
Task 76: narrate_skip timeout_s plumbed through
Files:
- Modify:
chat/services/skip_narration.py - Modify:
tests/test_skip_narration.py(add 1 test)
Spec: narrate_skip(timeout_s=60.0) accepts the parameter but doesn't pass it to client.generate(...). Per T53 review — fix:
# Inside narrate_skip:
result = await client.generate(
[
Message(role="system", content=system),
Message(role="user", content=user),
],
model=narrative_model,
max_tokens=200,
temperature=0.7,
timeout_s=timeout_s, # NEW: pipe through
)
If the underlying client.generate doesn't honor timeout_s (Featherless client signature accepts **params so it'll ride through harmlessly), this is a no-op at runtime but documents intent. Read chat/llm/featherless.py::FeatherlessClient.generate to confirm.
Test added: test_narrate_skip_passes_timeout_through — capture the kwargs passed to client.generate via a custom mock that records them; assert timeout_s is present with the expected value.
Commit: fix: plumb narrate_skip timeout_s through to client.generate (T76).
Task 77: AddresseeDecision.confidence as Literal type
Files:
- Modify:
chat/services/addressee.py - Modify:
tests/test_addressee.py(existing tests should still pass; add 1 test for the typed validation)
Spec: Per T74 review (Phase 2.5 carry-over). AddresseeDecision.confidence is currently str with a comment noting valid values. Tighten to Literal["high", "medium", "low"]:
from typing import Literal
class AddresseeDecision(BaseModel):
addressee_id: str
confidence: Literal["high", "medium", "low"] = "medium"
reason: str = ""
Pydantic will reject classifier output with values outside the literal set, falling back via the default= in classify(...). The existing fallback path returns confidence="low" so no behavior change for legitimate calls.
Test added: test_invalid_confidence_value_falls_back_to_default — feed canned classifier output with "confidence": "VERY_HIGH"; assert the result is the default fallback (Pydantic validation failed → classify falls back).
Commit: fix: AddresseeDecision.confidence as Literal[high|medium|low] (T77).
Task 78: search_memories docstring mentions SQL-bias
Files:
- Modify:
chat/state/memory.py— extend thesearch_memoriesdocstring with a one-liner aboutSIGNIFICANCE_RANK_BIAS. - No test changes needed (docstring-only).
Spec: Per T57 review. The current search_memories docstring describes only the Python-side composite re-rank with _SIGNIFICANCE_WEIGHT. T57 added a SQL-side SIGNIFICANCE_RANK_BIAS constant. Add to the docstring:
... (existing prose) ...
The result ordering applies TWO independent significance boosts:
- SQL-side: ``ORDER BY (rank - significance * SIGNIFICANCE_RANK_BIAS)``
pushes higher-significance memories ahead in the FTS5 candidate set.
- Python-side: a composite re-rank with `_SIGNIFICANCE_WEIGHT` reinforces
the ordering after candidate retrieval.
Commit: docs: search_memories docstring mentions SQL-side significance bias (T78).
Task 79: _witness_role_for defensive None handling
Files:
- Modify:
chat/services/prompt.py - Modify:
tests/test_prompt.py(add 1 test)
Spec: Per T71 review (Phase 2.5 carry-over). Current _witness_role_for(speaker_bot_id, host_bot_id) returns "host" if speaker_bot_id == host_bot_id else "guest". When host_bot_id is None (degenerate case — can happen if a chat row is half-seeded in a test), this returns "guest" even though the speaker is logically the host. Defensive fix:
def _witness_role_for(speaker_bot_id: str, host_bot_id: str | None) -> str:
"""..."""
if host_bot_id is None or speaker_bot_id == host_bot_id:
return "host"
return "guest"
Test added: test_witness_role_for_none_host_returns_host — call helper with host_bot_id=None; assert returns "host".
Commit: fix: _witness_role_for defensive None handling (T79).
Wave 2 — scene_summarize.py polish bundle (single task)
T80 bundles 5 backlog items (4 fixes + 1 test gap) all touching chat/services/scene_summarize.py and its test file. Single task by hot-file constraint; split into 5 commits internally for clean review bisection.
Task 80: scene_summarize.py polish
Files:
- Modify:
chat/services/scene_summarize.py - Modify:
tests/test_per_pov_summary.py
Spec: Five sub-fixes per the T58 review.
80.1 — Re-close suffix bloat guard
Problem: _build_key_quotes_suffix reads from memories.pov_summary. If a scene close runs twice (replay, manual re-close, idempotent retry), the second pass reads the rewritten text plus the previous "Key quotes:" suffix and appends a second one — recursive bloat.
Fix (recommended option): source quotes from event_log assistant_turn/user_turn text instead of memories.pov_summary. The event-log text is immutable per turn, so re-close produces identical key-quote text.
Implementation: in _build_key_quotes_suffix(conn, scene_id), replace the SELECT from memories with a JOIN that pulls turn text from event_log filtered by scene_id (via payload_json JSON extraction), keeping the significance-from-memories filter.
If event-log queries prove too expensive or fragile, fall back to: at the start of _build_key_quotes_suffix, strip any existing "\n\nKey quotes:\n" suffix from each pov_summary before computing the new one. Document the choice in a code comment.
Test added: test_scene_close_re_run_does_not_double_suffix — close a scene with significance ≥ 2; assert each pov_summary contains "Key quotes:" exactly ONCE. Then call apply_scene_close_summary again on the same scene; assert each pov_summary STILL contains "Key quotes:" exactly ONCE (no second append).
Commit: fix: guard scene close key-quote suffix against re-close bloat (T80.1).
80.2 — Thread detection transcript scoping
Problem: _read_recent_dialogue returns chat-wide history with no scene_id filter. Feeding chat-wide history to detect_threads will misattribute threads to the closing scene when a scene boundary falls inside the last 50 turns.
Fix: add a scene_started_at lookup at the top of apply_scene_close_summary. Filter the transcript to turns where the event_log row's created_at (or chat_clock_at) is >= scene_started_at. Read chat/state/world.py::get_scene for the started_at column.
If get_scene doesn't expose started_at, query directly: SELECT started_at FROM scenes WHERE id = ?.
Test added: test_thread_detection_uses_scene_scoped_transcript — seed a scene with 3 turns. Then close the scene. Then start a new scene and seed 3 more turns. Then close the second scene. Mock detect_threads to capture its scene_transcript arg. Assert the second close's transcript contains ONLY the 3 second-scene turns (not the first scene's turns).
Commit: fix: scope thread detection transcript to closing scene (T80.2).
80.3 — Log swallowed exceptions in detect_threads try/except
Problem: T58's try/except Exception: around detect_threads swallows programmer errors silently.
Fix: add a logging.getLogger(__name__).debug("detect_threads failed: %s", exc, exc_info=True) (or appropriate level) inside the except block. Use the standard library logging module (it's already configured by Phase 1's chat.app).
Test added: test_detect_threads_failure_is_logged — patch detect_threads to raise RuntimeError("test"). Use caplog (pytest fixture) to capture log output. Assert the log contains the error message.
Commit: fix: log swallowed exceptions in detect_threads try/except (T80.3).
80.4 — Scene close closed_at chat-clock semantics
Problem: T58 uses datetime.now(timezone.utc).isoformat() for closed_at on emitted events instead of chat-clock time. Diverges from chat-clock semantics elsewhere.
Fix: read chat["time"] at the top of apply_scene_close_summary (already loaded for per-POV writes via chat_clock_at). Pass chat_clock_at through to thread_closed events' closed_at field instead of datetime.now(...).
Test added: test_thread_closed_uses_chat_clock_time — seed a chat with chat["time"] = "2026-04-26T10:00:00+00:00". Mock detect_threads to return one ThreadCandidate(action="close", existing_thread_id="thr_x"). Close the scene. Inspect the emitted thread_closed event payload — assert closed_at == "2026-04-26T10:00:00+00:00", NOT a datetime.now(...) value.
Commit: fix: thread_closed uses chat-clock time, not wall clock (T80.4).
80.5 — T58 test coverage gaps
Spec: Per T58 review, these specific scenarios lack coverage. Add tests:
-
test_key_quote_truncation_at_200_chars— seed a memory with significance 2 and a 500-charpov_summary. Close scene. Assert the key-quote bullet for that memory is exactly 200 chars (or 199 + ellipsis, depending on the truncation impl). -
test_thread_detection_update_candidate_emits_thread_updated— mockdetect_threadsto returnThreadCandidate(action="update", existing_thread_id="thr_x", summary="updated"). Close. Assert athread_updatedevent landed. -
test_thread_detection_close_candidate_emits_thread_closed— same setup withaction="close". Assert athread_closedevent landed.
(The try/except fallback test from 80.3 covers the exception path; no separate test needed.)
Commit: test: T58 coverage gaps (truncation, update/close paths) (T80.5).
Verification gates for T80
- All 5 sub-fix commits cleanly split per
git show. - All 13 existing per_pov_summary tests still pass.
- 5+ new tests pass.
- Full suite green.
git diff phase-3.5 --statshows ONLYchat/services/scene_summarize.pyandtests/test_per_pov_summary.py.
Wave 3 — Typed exception for skip controllers (single task)
Task 81: ChatNotFoundError for skip route routing
Files:
- Modify:
chat/web/skip.py— define a typed exception and raise it instead of genericValueError("chat not found"). - Modify:
chat/web/drawer.py— replace string-prefix sniff withexcept ChatNotFoundError: 404. - Modify:
tests/test_drawer_events_threads_skip.py(add 1 test).
Spec: Per T62 review. Drawer skip routes use str(exc).startswith("chat not found") to distinguish 404 from 400. Fragile if error wording changes. Replace with a typed exception subclass.
# chat/web/skip.py top:
class ChatNotFoundError(Exception):
"""Raised when a chat_id doesn't resolve. Maps to 404 in HTTP routes."""
# Inside process_elision_skip / process_jump_skip:
chat = get_chat(conn, chat_id)
if chat is None:
raise ChatNotFoundError(f"chat {chat_id!r} not found")
# Other validation failures still raise ValueError (mapped to 400).
# chat/web/drawer.py skip routes:
try:
result = await process_elision_skip(...)
except ChatNotFoundError:
raise HTTPException(status_code=404, detail="chat not found")
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
Test added: test_skip_route_raises_404_via_typed_exception — POST to /chats/nonexistent/drawer/skip/elision; assert response 404. (Existing tests should still pass — they check 404 for missing chats and 400 for validation errors.)
Commit: fix: typed ChatNotFoundError replaces string-prefix sniff in skip routes (T81).
Wave 4 — turns.py wiring (single task)
Task 82: post_turn meanwhile-digest consumption + skip-command scene close
Files:
- Modify:
chat/web/turns.py - Modify:
tests/test_turn_flow.py(add 2 tests) - Modify:
tests/test_phase3_integration.py(T66 test 4 currently callsconsume_pending_meanwhile_digestsdirectly — update it to assert the helper is now invoked AUTOMATICALLY by post_turn)
Spec: Two related fixes both touching post_turn.
82.1 — Wire consume_pending_meanwhile_digests into post_turn
Problem (T66): chat/services/prompt.py::consume_pending_meanwhile_digests is defined but never called. Meanwhile digests stay pending forever in production.
Fix: call the helper at the END of post_turn (after the assistant_turn lands and all classifier passes finish, BEFORE the response returns). The helper appends meanwhile_digest_consumed events for each pending digest, marking them as surfaced. Idempotent — re-calling produces zero events.
Place the call only when the chat has just-surfaced digests. The simplest pattern: always call after the primary turn — the helper short-circuits when nothing's pending. (Cost: 1 trivial DB query per turn.)
from chat.services.prompt import consume_pending_meanwhile_digests
# Near the end of post_turn, before publish/return:
consume_pending_meanwhile_digests(conn, chat_id)
Test added: test_post_turn_consumes_pending_meanwhile_digests — seed a pending digest. POST a turn. Assert: a meanwhile_digest_consumed event landed in event_log AND list_pending_meanwhile_digests is now empty.
Update existing T66 test 4: the integration test currently calls consume_pending_meanwhile_digests directly to drive the consumption side-effect. After T82, the consumption happens automatically inside post_turn. Remove the explicit call in the test (or assert the count even without calling explicitly).
Commit: fix: post_turn consumes pending meanwhile digests (T82.1).
82.2 — Skip command runs scene close detection
Problem (T62): A user typing "skip an hour" via natural-language skip causes the skip path to bypass scene close detection. The user's prose may have signaled close intent ("fade out, skip an hour") but the skip path returns 204 without checking.
Fix: in the natural-language skip dispatch (T62 intent="skip_elision" branch), call detect_scene_close BEFORE the skip controller runs. If close is detected, append scene_closed first, then run the skip flow.
Order matters: scene close → skip narration → time advance. The narration should reflect the new scene start at the new time.
if parsed.intent == "skip_elision":
# Run scene close detection first - the user may have signaled close.
if scene := active_scene(conn, chat_id):
close_decision = await detect_scene_close(client, classifier_model=...,
prose=parsed.prose, ...)
if close_decision.should_close:
append_and_apply(conn, kind="scene_closed", payload={"scene_id": scene["id"], ...})
await apply_scene_close_summary(conn, client, ...)
# ... existing skip dispatch ...
Test added: test_natural_language_skip_with_close_signal_closes_scene — user prose "fade out, skip to morning". Mock detect_scene_close to return True. Mock narrate_skip. Assert: scene_closed event lands AND time_skip_elision event lands AND ordering is scene_closed BEFORE time_skip_elision.
Commit: fix: natural-language skip runs scene close detection (T82.2).
Verification gates for T82
- 2 new tests pass.
- All existing turn_flow + phase3_integration tests still pass (T66 test 4 may need a small update; document in commit).
- Full suite green.
git diff phase-3.5 --stat: onlychat/web/turns.py,tests/test_turn_flow.py,tests/test_phase3_integration.py.
Wave 5 — regenerate.py polish bundle (single task)
T83 bundles 5 regenerate-related backlog items. All touch chat/services/regenerate.py. Single task by hot-file constraint; 5 internal commits.
Task 83: regenerate.py polish
Files:
- Modify:
chat/services/regenerate.py - Optional: create
chat/services/turn_common.py(new shared-helpers module for 83.2 DRY extraction) - Modify:
chat/web/turns.py(only if 83.2 extracts shared helpers — minimal) - Modify:
tests/test_regenerate.py(add 5+ tests)
Spec: Five sub-fixes per the T73 + T61 reviews.
83.1 — Cancel/stop hook for in-flight regenerate streams
Problem: post_turn registers stream tasks in _in_flight_tasks so /turns/cancel cancels them. Regenerate doesn't.
Fix: mirror the registration pattern from post_turn (Phase 2.5 T74.4 documented this). Wrap the regenerate stream in asyncio.create_task(...), register in _in_flight_tasks[chat_id], await, unregister in finally.
Test added: test_regenerate_registers_task_in_in_flight_tasks — mid-stream snapshot pattern (mirror Phase 3 T64.fix-up's _SnapshotMock). Assert the chat_id is registered during streaming.
Commit: feat: regenerate registers stream task in _in_flight_tasks for cancellation (T83.1).
83.2 — DRY: extract recent-dialogue + prior-edges helpers
Problem: Recent-dialogue assembly + prior-edges block are duplicated between chat/services/regenerate.py and chat/web/turns.py. Diff drift risk.
Fix: create chat/services/turn_common.py with:
def read_recent_dialogue(conn, chat_id: str, limit: int = 50) -> list[dict]:
"""Pull the last N user_turn + assistant_turn rows for the chat as
structured [{speaker, text}, ...]. Filters superseded_by IS NULL
AND hidden = 0. Used by post_turn AND regenerate AND scene_summarize."""
def gather_prior_edges(conn, chat_id: str, present_ids: list[str]) -> dict[tuple, dict]:
"""Build {(src, tgt): {affinity, trust, summary}} dict for all directed
pairs where both endpoints are in present_ids."""
Update chat/web/turns.py::post_turn to use the new helpers. Update chat/services/regenerate.py::regenerate_assistant_turn to use them too. Existing tests should pass unchanged.
Test added: tests/test_turn_common.py (NEW file) with 2-3 tests: dialogue read filters superseded; prior_edges builds correct keys.
Commit: refactor: extract turn_common helpers from regenerate + turns (T83.2).
83.3 — Sibling-discovery query optimization
Problem (T73): regenerate.py's sibling-assistant-turn lookup scans ALL non-superseded assistant_turn rows globally with no chat_id predicate.
Fix: add a json_extract(payload_json, '$.chat_id') = ? predicate in the SQL query, plus LIMIT 50 for safety. Mirror Phase 3 T64's _last_meanwhile_speaker SQL pattern.
Test added: test_regenerate_sibling_lookup_scoped_to_chat — seed two chats both with regenerate-able turn groups. Regenerate in chat A. Assert the sibling lookup didn't return any chat-B rows (verify via SQL query interception or by ensuring cross-chat sibling pollution wouldn't break).
Commit: perf: scope regenerate sibling-lookup to chat_id (T83.3).
83.4 — Lifecycle-transition rollback on regenerate
Problem (T61): Regenerate doesn't roll back lifecycle transitions from the superseded turn. event_started/event_completed rows from a superseded turn remain. May double-emit promotion artifacts.
Fix (Phase 3.5 first cut — minimal): at the START of regenerate, scan the superseded turn's downstream events. For any event_started/event_completed/event_cancelled linked to the superseded turn (via superseded_by chain on the original assistant_turn AND a back-pointer in the lifecycle event payload), revert the projected state by appending compensating events:
- Original was
event_startedfor evt_x → emitevent_started_undone(NEW event kind?) OR mark the original superseded. - Original was
event_completed→ similarly.
This is too invasive for one Phase 3.5 commit. Instead, the simpler Phase 3.5 fix: document the limitation explicitly in the regenerate flow with a TODO + add an assert or warning when regenerate detects prior lifecycle transitions on the superseded turn. Real rollback support is a Phase 4 item.
Test added: test_regenerate_with_prior_lifecycle_logs_warning — seed a turn that emits event_completed. Regenerate. Assert a warning log entry mentions the un-rolled-back transition.
Commit: chore: document regenerate lifecycle-rollback limitation (T83.4).
(If the implementer judges that proper rollback IS feasible in the time available, they can implement it instead — but the safe Phase 3.5 default is documentation + warning. The choice is the implementer's; report which path was taken.)
83.5 — Event-detection ordering symmetry
Problem (T61): post_turn runs lifecycle BETWEEN interjection and scene-close; regenerate.py runs lifecycle at the END. Benign because regenerate has no scene-close path, but worth tidying.
Fix: move the event-detection block in regenerate.py to mirror post_turn's position (between interjection regenerate and the function's tail). Cosmetic — no behavior change.
Test: verify all existing regenerate tests still pass after the move.
Commit: refactor: regenerate event-detection ordering mirrors post_turn (T83.5).
Verification gates for T83
- 5 internal commits visible.
- All 6 existing regenerate tests still pass.
- All 10+ existing turn_flow tests still pass.
- 5+ new tests pass.
- Full suite green.
git diff phase-3.5 --stat:chat/services/regenerate.py,chat/services/turn_common.py(if extracted),chat/web/turns.py(only if helper-call sites updated),tests/test_regenerate.py,tests/test_turn_common.py(if new).
Wave 6 — Final polish (parallel, 3 tasks)
Task 84: Unified record-memory API
Files:
- Modify:
chat/services/memory_write.py - Modify:
tests/test_memory_write.py(add tests) - Modify:
chat/web/meanwhile.py(call site update — minimal)
Spec: Per T64 review. record_meanwhile_memory and record_turn_memory_for_present share private _write_one_memory but expose two separate public APIs. Unify:
def record_turn_memory(
conn,
*,
chat_id: str,
host_bot_id: str,
guest_bot_id: str | None,
narrative_text: str,
scene_id: int | None = None,
chat_clock_at: str | None = None,
source: str = "direct",
significance: int = 1,
you_present: bool = True, # NEW: False for meanwhile turns
) -> dict[str, tuple[int, int | None]]:
"""Single entry-point for memory writes. When you_present is True,
witnesses are [you=1, host=1, guest if present]. When False (meanwhile),
witnesses are [you=0, host=1, guest=1] and host_bot_id + guest_bot_id
are both required."""
record_meanwhile_memory becomes a thin wrapper (kept for backward compat) that calls record_turn_memory(..., you_present=False). record_turn_memory_for_present is renamed/aliased to record_turn_memory for the same reason.
Update chat/web/meanwhile.py to call the unified function with you_present=False. Verify the existing meanwhile turn tests still pass.
Tests added: 2 new tests in tests/test_memory_write.py:
test_record_turn_memory_you_present_false_writes_meanwhile_witness_mask— assert witness[0, 1, 1].test_record_turn_memory_you_present_true_default_writes_normal_witness_mask— assert[1, 1, 0|1].
Commit: refactor: unified record_turn_memory API with you_present kwarg (T84).
Task 85: JSON-build audit + meanwhile stop-button route-level test
Files:
- Audit: read all
chat/state/*.pyfor f-string JSON construction. NO PRODUCTION CHANGES unless an issue is found. - Modify (only if issue found): the affected state module.
- Modify:
tests/test_meanwhile_turn_flow.py(add stop-button route-level test).
Spec: Two unrelated minor follow-ups bundled by being small.
85.1 — JSON-build audit (T63)
T63 originally used f-string interpolation for participants_json (fixed during review). Audit other state modules for the same pattern. Grep:
grep -rn 'f["\']\[' chat/state/ chat/services/ chat/eventlog/
For each f-string that constructs a JSON string from user-controllable data, replace with json.dumps(...). If NONE found (likely — most code already uses json.dumps), add a one-line note in the commit body confirming the audit.
If issues are found, fix them and add tests asserting JSON validity for inputs containing quote/backslash characters.
Commit: chore: audit JSON-string construction sites; replace any f-string usages with json.dumps (T85.1) (or "audit confirms no issues").
85.2 — Meanwhile stop-button route-level test
T64's fix-up registered stream tasks in _in_flight_tasks. There's a unit test pinning registration but no test that drives /turns/cancel against a meanwhile turn.
Test added: test_meanwhile_turn_can_be_cancelled_via_route — start a meanwhile turn (POST /turns), capture the in-flight task, immediately POST /turns/cancel, assert the task transitions to cancelled state and the assistant_turn payload has truncated=True.
This test may be tricky to write reliably (timing). If the existing _SnapshotMock pattern doesn't support mid-stream cancel injection, document the limitation and add a smaller test that just verifies the cancel endpoint works on the chat (asserts no tasks remaining after cancel).
Commit: test: meanwhile turn cancellation via /turns/cancel route (T85.2).
Task 86: Frontend turn_html_replace SSE handler
Files:
- Modify:
chat/templates/chat.html(or wherever the chat page's SSE consumer lives) — add a JS event handler forturn_html_replace. - Optional:
chat/static/app.css(if styling needs adjustment for replaced turns). - Modify:
tests/test_streaming_ux.py(add 1 test).
Spec: Per Phase 2.5 T73.1 (carry-over). Regenerate's backend broadcasts a turn_html_replace SSE event but no live tab swaps the prior turn. Wire the JS:
<!-- In chat.html, near the existing SSE setup: -->
<script>
const sse = new EventSource('/chats/{{ chat.id }}/events');
// Existing: turn_html appends new turn HTML.
sse.addEventListener('turn_html', (ev) => {
const turnsContainer = document.getElementById('turns');
turnsContainer.insertAdjacentHTML('beforeend', ev.data);
});
// NEW: turn_html_replace swaps an existing turn's DOM in place.
sse.addEventListener('turn_html_replace', (ev) => {
const data = JSON.parse(ev.data);
// data: {html, turn_id, supersedes_id}
const oldNode = document.getElementById(`turn-${data.supersedes_id}`);
if (oldNode) {
const tmpl = document.createElement('template');
tmpl.innerHTML = data.html.trim();
oldNode.replaceWith(tmpl.content.firstChild);
} else {
// Fallback: append if the prior turn isn't in the DOM (edge case).
document.getElementById('turns').insertAdjacentHTML('beforeend', data.html);
}
});
</script>
(Adapt the actual template structure — read chat/templates/chat.html first to see the existing SSE setup. The turn-{id} DOM-id pattern needs to be consistent with how the backend renders turn HTML — verify by checking chat/web/render.py.)
Test added: Hard to write a true browser-driven test in pytest. The simplest pin: a unit test that asserts the rendered HTML for a regenerate response contains the turn_html_replace SSE event AND the event payload contains the expected supersedes_id. (Already exists in T73.1 tests — don't duplicate.) Optionally add a Selenium/playwright test if the project has frontend testing infrastructure.
If pytest-only is the available coverage, document the manual smoke step in CLAUDE.md ("multi-tab regenerate" smoke test) and rely on backend pin tests + manual verification.
Commit: feat: frontend turn_html_replace SSE handler (T86).
Wave 7 — Docs sweep (single task)
Task 87: CLAUDE.md backlog burn-down
Files:
- Modify:
CLAUDE.md
Spec: Walk through the Phase 2.6/3 backlog and Phase 3.5/4 backlog sections. For each item shipped in T76–T86, remove from backlog list. Add a new section "Phase 3.5 status" near the existing "Phase 3 status" listing what shipped. Add a new "Phase 3.6 / 4 backlog" section if any new items were discovered during T76–T86 reviews (likely some — the review cycle always surfaces fresh follow-ups).
If any task during T76–T86 chose to defer a sub-item (e.g., T83.4 chose documentation over rollback impl), keep that sub-item in the new "Phase 3.6+ deferred" section with the rationale.
Mark §13 Phase 3 deliverables in the requirements doc as shipped (already done in T67) — verify and don't re-mark.
Commit: docs: phase 3.5 status, prune shipped backlog items (T87).
Wrap-up
After Wave 7 lands:
- Run full suite on
phase-3.5: should be ~330+ tests passing (315 from Phase 3 + ~15-25 new across the wave). - Manual smoke (recommended before opening the PR):
- Multi-tab regenerate: open two tabs on the same chat; click Regenerate on one; verify the other tab swaps the regenerated turn live (T86).
- Trigger a meanwhile scene; close it; resume the main chat; verify the digest renders ONCE on the next you-turn AND is consumed automatically (T82.1).
- Type "fade out, skip an hour" — verify both scene close + skip fire in correct order (T82.2).
- Plan an event, complete it via a turn, then regenerate that turn — observe the warning log mentioning un-rolled-back lifecycle transitions (T83.4).
- Push
phase-3.5to gitea. - Open PR
phase-3.5 → main. - Phase 3.6+ residuals (track in CLAUDE.md): genuine lifecycle-rollback impl, structured fixture builder for canned-queue tests, scene-close-on-cancel revisit if play-testing surfaces a regression.
Notes for the controller running this plan
- Don't dispatch Wave 5 until Wave 4 is merged AND green. T82 (turns.py) + T83 (regenerate.py + optionally turns.py for shared-helper extraction) both touch
turns.py. Sequential ordering prevents merge conflict. - After each parallel wave, run a code-review subagent. Combined spec+quality review is acceptable for trivial tasks (T76, T77, T78, T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83) — each bundles 4-5 sub-fixes.
- Token-spend rough estimate: Phase 3.5 should be ~50-60% the size of Phase 3 (smaller scope, all reuse, no new schema). Per-task spend similar to Phase 2.5's bundled tasks.
- DO NOT break existing v1/v2/v3 surface contracts. Every test file that was green at the start of Phase 3.5 must stay green at the end. The cross-feature integration tests (
tests/test_phase3_integration.py) are particularly load-bearing — if any of T80/T82/T83 break them, that's a regression to investigate, NOT to suppress.