docs: add Phase 3.5 cleanup plan (Phase 2.6/3 + 3.5/4 backlog burn-down)

12 tasks across 7 waves consolidating the 17-item backlog tracked in
CLAUDE.md (7 from Phase 2.6/3 + 10 from Phase 3.5/4). Items are
grouped by file ownership so each wave stays file-disjoint:

- Wave 1 (parallel 4-way): trivial single-line/single-file fixes
  (timeout_s plumbing, Literal type, docstring, defensive None)
- Wave 2 (single): scene_summarize.py polish bundle (5 T58 items)
- Wave 3 (single): typed ChatNotFoundError for skip routes
- Wave 4 (single): turns.py wiring (consume_pending_meanwhile_digests
  + natural-language skip runs scene close detection)
- Wave 5 (single): regenerate.py polish (cancel hook + DRY +
  sibling query + lifecycle rollback documentation + ordering)
- Wave 6 (parallel 3-way): unified record_turn_memory API + JSON
  audit + frontend turn_html_replace SSE handler
- Wave 7 (single): docs sweep

No schema migrations. Bundled tasks split into per-item sub-commits
for clean review bisection. Uses task ids T76-T87 to avoid collision
with prior phases (Phase 3 used T49-T67, Phase 2.5 used T68-T75).
This commit is contained in:
Joseph Doherty
2026-04-26 21:33:16 -04:00
parent 753cec327f
commit fb7e97260b
2 changed files with 728 additions and 0 deletions
@@ -0,0 +1,709 @@
# Roleplay Engine — Phase 3.5 Cleanup Plan
> **For Claude:** REQUIRED SUB-SKILL: Use `superpowers-extended-cc:executing-plans` to implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for waves that fan out to multiple subagents.
**Goal:** Burn down the combined Phase 2.6/3 + Phase 3.5/4 backlog tracked in [`CLAUDE.md`](../../CLAUDE.md). 17 follow-up items consolidated into 12 tasks (file-disjoint where possible) across 7 waves so several can run in parallel.
**Architecture:** No new architecture, no schema migrations, no new event kinds. Every change here is either a polish on an existing service/route, a refactor for DRY/typing/observability, or a UI affordance for state that already exists (frontend `turn_html_replace` consumer).
**Tech Stack:** Same as Phase 3. No new dependencies.
**Source-of-truth references:**
- Backlog list: [`CLAUDE.md`](../../CLAUDE.md) §"Phase 2.6 / 3 backlog" (7 items, 6 actionable + 1 deferred) + §"Phase 3.5 / 4 backlog" (~13 items grouped by source review) = 17 items net.
- Conventions: [`CLAUDE.md`](../../CLAUDE.md) §"Behavioral defaults" + §"Phase 3 status".
- Phase 2.5 cleanup plan (style, file-bundling pattern): [2026-04-26-v2.5-phase2.5-cleanup.md](2026-04-26-v2.5-phase2.5-cleanup.md).
When a task says "see §X", that's the requirements doc unless stated otherwise.
---
## Pre-flight
**Branch:** create `phase-3.5` from the latest `main` after Phase 3 has merged (it has — main is at `753cec3`):
```bash
git checkout main && git pull && git checkout -b phase-3.5
```
**Schema baseline:** Phase 3 leaves the DB at version 11. Phase 3.5 adds **no migrations**. Schema-version assertion in `tests/test_world.py` stays at 11.
**Pinned non-negotiables (carried forward):**
- State changes go through the event log. Use `append_and_apply(conn, kind, payload)` for the live path; `apply_event` only after a fresh `append_event` returning the new id.
- Witness filter every memory read at SQL level (hard `WHERE` constraint; never a soft signal).
- Edges are directed; `botA → botB` and `botB → botA` are independent records.
- Per-POV scene summaries — never write omniscient narration.
- TDD: every task starts with a failing test (or a regression test pinning existing contract before refactor).
- One commit per task minimum. Tasks that bundle multiple backlog items SHOULD split commits within the task — one commit per item — so review can bisect cleanly.
**Verification before claiming done:** Use `superpowers-extended-cc:verification-before-completion` — run the test command, paste actual output. Don't assume green.
---
## Backlog item → task mapping
17 items consolidated into 12 tasks by **file ownership** (so each wave's tasks stay file-disjoint). Bundled tasks split commits internally.
| # | Backlog item | Source | Task |
|---|--------------|--------|------|
| 1 | `narrate_skip` `timeout_s` not piped through | T53 review | **T76** |
| 2 | `AddresseeDecision.confidence` should be `Literal[...]` | T74 review (Phase 2.5 carry-over) | **T77** |
| 3 | `search_memories` docstring missing SQL-bias note | T57 review | **T78** |
| 4 | `_witness_role_for` defensive coding for `host_bot_id is None` | T71 review (Phase 2.5 carry-over) | **T79** |
| 5 | Scene close re-close suffix bloat risk | T58 review | **T80** |
| 6 | Thread detection transcript scoping (chat-wide vs scene-scoped) | T58 review | **T80** |
| 7 | Swallowed `Exception` in `detect_threads` try/except — log at debug | T58 review | **T80** |
| 8 | Scene close `closed_at` clock divergence | T58 review | **T80** |
| 9 | T58 test coverage gaps (truncation, update/close, fallback) | T58 review | **T80** |
| 10 | Error-message prefix sniff for 404 vs 400 routing in skip routes | T62 review | **T81** |
| 11 | `consume_pending_meanwhile_digests` not wired into `post_turn` | T66 integration tests | **T82** |
| 12 | Skip command bypasses scene close detection | T62 review | **T82** |
| 13 | Cancel/stop hook for in-flight regenerate streams | T73 review (Phase 2.5 carry-over) | **T83** |
| 14 | DRY: regenerate vs post_turn (recent-dialogue + prior-edges duplication) | T73 review (Phase 2.5 carry-over) | **T83** |
| 15 | Sibling-discovery query optimization in regenerate.py | T73 review (Phase 2.5 carry-over) | **T83** |
| 16 | Regenerate doesn't roll back lifecycle transitions from superseded turn | T61 review | **T83** |
| 17 | Asymmetry in event-detection ordering (turns.py vs regenerate.py) | T61 review | **T83** |
| 18 | `record_meanwhile_memory` / `record_turn_memory_for_present` unified API | T64 review | **T84** |
| 19 | `participants_json` JSON-build audit (other state modules) | T63 review | **T85** |
| 20 | Stop-button cancellation route-level coverage for meanwhile turns | T64 review | **T85** |
| 21 | Frontend handler for `turn_html_replace` SSE event | T73.1 review (Phase 2.5 carry-over) | **T86** |
| — | Docs sweep — remove shipped items, capture residuals | (this plan) | **T87** |
**Deferred (not in this plan):**
- **Cross-feature canned-queue brittleness** (Wave 6b cross-feature): would require a structured-fixture refactor across 3+ test files. Tracking but not in scope here — open as its own work if it surfaces a real test-maintenance pain.
- **Scene-close-on-cancel UX revisit** (T74.3): pinned to existing behavior; no action needed unless real play surfaces a regression.
---
## Parallel-Execution Strategy
Same pattern as Phases 2.5 and 3. Seven waves: parallel within each wave (file-disjoint), serial across waves.
### How to dispatch a wave in parallel
Use the **Agent tool with `isolation: "worktree"`** so each subagent gets its own git worktree. (If the controlling session's working directory is **not** the chat repo, create worktrees manually with `git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-3.5` from inside the chat repo and pass the worktree path explicitly into each subagent prompt.)
In a single message, dispatch all tasks in the wave:
```
Agent({
description: "Wave 1 — T76 narrate_skip timeout_s",
subagent_type: "general-purpose",
isolation: "worktree",
prompt: "<full task text from below>",
})
Agent({ ...T77... })
Agent({ ...T78... })
Agent({ ...T79... })
```
### After a wave completes
1. Each subagent returns its worktree path and commit SHA(s).
2. **Run a spec + code-quality reviewer subagent on each completed task.** Combined review acceptable for purely mechanical fixes (T76T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83).
3. **Merge the wave into `phase-3.5`** in any order (file-disjointness guarantees no conflict). Use `--no-ff`.
4. **Run the full test suite** on the merged `phase-3.5`. If red, the wave's mutual-independence assumption was violated — bisect, fix, re-merge.
5. **Push `phase-3.5`** to gitea.
6. Optionally clean up worktrees: `git worktree remove .worktrees/<branch>` and `git branch -D <branch>`.
### Conflict prevention checklist
For each parallel wave, verify the **Files** sections of all tasks have **no overlapping paths**. Hot files in this plan: `chat/web/turns.py` (T82 only), `chat/services/regenerate.py` (T83 only), `chat/web/skip.py` + `chat/web/drawer.py` (T81 only). Each is owned by exactly one task.
### Failure recovery
If one subagent fails: cancel it, merge the others' successful work, re-dispatch the failed task as a single follow-up. Don't block the wave.
### Why each wave is parallel-safe
| Wave | Tasks | Hot files touched | Disjoint? |
|------|-------|-------------------|-----------|
| 1 | T76, T77, T78, T79 | 4 different services/state files; no overlap | ✅ |
| 2 | T80 | `chat/services/scene_summarize.py` | (single task) |
| 3 | T81 | `chat/web/skip.py` + `chat/web/drawer.py` (typed exception) | (single task) |
| 4 | T82 | `chat/web/turns.py` | (single task) |
| 5 | T83 | `chat/services/regenerate.py` (+ optional new shared-helpers module) | (single task) |
| 6 | T84, T85, T86 | `chat/services/memory_write.py` (T84); audit + tests (T85); frontend HTML/JS (T86) | ✅ |
| 7 | T87 | `CLAUDE.md` | (single task) |
---
## Task overview
```
Wave 1 ─┬─ T76: narrate_skip timeout_s plumbed through (skip_narration.py)
├─ T77: AddresseeDecision.confidence as Literal (addressee.py)
├─ T78: search_memories docstring SQL-bias note (memory.py)
└─ T79: _witness_role_for defensive None handling (prompt.py)
Wave 2 ─── T80: scene_summarize.py polish bundle (5 T58 items)
Wave 3 ─── T81: typed exception for skip controllers (skip.py + drawer.py)
Wave 4 ─── T82: turns.py wiring (consume_pending_meanwhile_digests + skip-bypass scene close)
Wave 5 ─── T83: regenerate.py polish bundle (cancel hook + DRY + sibling query +
lifecycle rollback + ordering asymmetry)
Wave 6 ─┬─ T84: memory_write.py unified record-memory API
├─ T85: JSON-build audit + meanwhile stop-button route-level test
└─ T86: frontend turn_html_replace SSE handler
Wave 7 ─── T87: docs sweep — prune CLAUDE.md backlogs, capture residuals
```
Critical path: 7 sequential merge points. Total tasks: 12. Wall-clock parallelism advantage: Waves 1 and 6 dispatch concurrently (4-way and 3-way respectively). Waves 2, 3, 4, 5, 7 are single-task by hot-file constraint.
---
## Wave 1 — Independent small fixes (parallel, 4 tasks)
All 4 tasks are tiny (1-line + tests). Fully file-disjoint.
### Task 76: `narrate_skip` timeout_s plumbed through
**Files:**
- Modify: `chat/services/skip_narration.py`
- Modify: `tests/test_skip_narration.py` (add 1 test)
**Spec:** `narrate_skip(timeout_s=60.0)` accepts the parameter but doesn't pass it to `client.generate(...)`. Per T53 review — fix:
```python
# Inside narrate_skip:
result = await client.generate(
[
Message(role="system", content=system),
Message(role="user", content=user),
],
model=narrative_model,
max_tokens=200,
temperature=0.7,
timeout_s=timeout_s, # NEW: pipe through
)
```
If the underlying `client.generate` doesn't honor `timeout_s` (Featherless client signature accepts `**params` so it'll ride through harmlessly), this is a no-op at runtime but documents intent. Read `chat/llm/featherless.py::FeatherlessClient.generate` to confirm.
**Test added:** `test_narrate_skip_passes_timeout_through` — capture the kwargs passed to `client.generate` via a custom mock that records them; assert `timeout_s` is present with the expected value.
**Commit:** `fix: plumb narrate_skip timeout_s through to client.generate (T76)`.
---
### Task 77: `AddresseeDecision.confidence` as Literal type
**Files:**
- Modify: `chat/services/addressee.py`
- Modify: `tests/test_addressee.py` (existing tests should still pass; add 1 test for the typed validation)
**Spec:** Per T74 review (Phase 2.5 carry-over). `AddresseeDecision.confidence` is currently `str` with a comment noting valid values. Tighten to `Literal["high", "medium", "low"]`:
```python
from typing import Literal
class AddresseeDecision(BaseModel):
addressee_id: str
confidence: Literal["high", "medium", "low"] = "medium"
reason: str = ""
```
Pydantic will reject classifier output with values outside the literal set, falling back via the `default=` in `classify(...)`. The existing fallback path returns `confidence="low"` so no behavior change for legitimate calls.
**Test added:** `test_invalid_confidence_value_falls_back_to_default` — feed canned classifier output with `"confidence": "VERY_HIGH"`; assert the result is the default fallback (Pydantic validation failed → `classify` falls back).
**Commit:** `fix: AddresseeDecision.confidence as Literal[high|medium|low] (T77)`.
---
### Task 78: `search_memories` docstring mentions SQL-bias
**Files:**
- Modify: `chat/state/memory.py` — extend the `search_memories` docstring with a one-liner about `SIGNIFICANCE_RANK_BIAS`.
- No test changes needed (docstring-only).
**Spec:** Per T57 review. The current `search_memories` docstring describes only the Python-side composite re-rank with `_SIGNIFICANCE_WEIGHT`. T57 added a SQL-side `SIGNIFICANCE_RANK_BIAS` constant. Add to the docstring:
```
... (existing prose) ...
The result ordering applies TWO independent significance boosts:
- SQL-side: ``ORDER BY (rank - significance * SIGNIFICANCE_RANK_BIAS)``
pushes higher-significance memories ahead in the FTS5 candidate set.
- Python-side: a composite re-rank with `_SIGNIFICANCE_WEIGHT` reinforces
the ordering after candidate retrieval.
```
**Commit:** `docs: search_memories docstring mentions SQL-side significance bias (T78)`.
---
### Task 79: `_witness_role_for` defensive None handling
**Files:**
- Modify: `chat/services/prompt.py`
- Modify: `tests/test_prompt.py` (add 1 test)
**Spec:** Per T71 review (Phase 2.5 carry-over). Current `_witness_role_for(speaker_bot_id, host_bot_id)` returns `"host" if speaker_bot_id == host_bot_id else "guest"`. When `host_bot_id` is `None` (degenerate case — can happen if a chat row is half-seeded in a test), this returns `"guest"` even though the speaker is logically the host. Defensive fix:
```python
def _witness_role_for(speaker_bot_id: str, host_bot_id: str | None) -> str:
"""..."""
if host_bot_id is None or speaker_bot_id == host_bot_id:
return "host"
return "guest"
```
**Test added:** `test_witness_role_for_none_host_returns_host` — call helper with `host_bot_id=None`; assert returns `"host"`.
**Commit:** `fix: _witness_role_for defensive None handling (T79)`.
---
## Wave 2 — `scene_summarize.py` polish bundle (single task)
T80 bundles 5 backlog items (4 fixes + 1 test gap) all touching `chat/services/scene_summarize.py` and its test file. Single task by hot-file constraint; split into 5 commits internally for clean review bisection.
### Task 80: scene_summarize.py polish
**Files:**
- Modify: `chat/services/scene_summarize.py`
- Modify: `tests/test_per_pov_summary.py`
**Spec:** Five sub-fixes per the T58 review.
#### 80.1 — Re-close suffix bloat guard
**Problem:** `_build_key_quotes_suffix` reads from `memories.pov_summary`. If a scene close runs twice (replay, manual re-close, idempotent retry), the second pass reads the rewritten text plus the previous "Key quotes:" suffix and appends a second one — recursive bloat.
**Fix (recommended option):** source quotes from `event_log` `assistant_turn`/`user_turn` text instead of `memories.pov_summary`. The event-log text is immutable per turn, so re-close produces identical key-quote text.
Implementation: in `_build_key_quotes_suffix(conn, scene_id)`, replace the SELECT from `memories` with a JOIN that pulls turn text from `event_log` filtered by scene_id (via `payload_json` JSON extraction), keeping the significance-from-memories filter.
If event-log queries prove too expensive or fragile, fall back to: at the start of `_build_key_quotes_suffix`, strip any existing `"\n\nKey quotes:\n"` suffix from each `pov_summary` before computing the new one. Document the choice in a code comment.
**Test added:** `test_scene_close_re_run_does_not_double_suffix` — close a scene with significance ≥ 2; assert each pov_summary contains "Key quotes:" exactly ONCE. Then call `apply_scene_close_summary` again on the same scene; assert each pov_summary STILL contains "Key quotes:" exactly ONCE (no second append).
**Commit:** `fix: guard scene close key-quote suffix against re-close bloat (T80.1)`.
#### 80.2 — Thread detection transcript scoping
**Problem:** `_read_recent_dialogue` returns chat-wide history with no `scene_id` filter. Feeding chat-wide history to `detect_threads` will misattribute threads to the closing scene when a scene boundary falls inside the last 50 turns.
**Fix:** add a `scene_started_at` lookup at the top of `apply_scene_close_summary`. Filter the transcript to turns where the event_log row's `created_at` (or `chat_clock_at`) is `>= scene_started_at`. Read `chat/state/world.py::get_scene` for the started_at column.
If `get_scene` doesn't expose `started_at`, query directly: `SELECT started_at FROM scenes WHERE id = ?`.
**Test added:** `test_thread_detection_uses_scene_scoped_transcript` — seed a scene with 3 turns. Then close the scene. Then start a new scene and seed 3 more turns. Then close the second scene. Mock `detect_threads` to capture its `scene_transcript` arg. Assert the second close's transcript contains ONLY the 3 second-scene turns (not the first scene's turns).
**Commit:** `fix: scope thread detection transcript to closing scene (T80.2)`.
#### 80.3 — Log swallowed exceptions in `detect_threads` try/except
**Problem:** T58's `try/except Exception:` around `detect_threads` swallows programmer errors silently.
**Fix:** add a `logging.getLogger(__name__).debug("detect_threads failed: %s", exc, exc_info=True)` (or appropriate level) inside the except block. Use the standard library `logging` module (it's already configured by Phase 1's `chat.app`).
**Test added:** `test_detect_threads_failure_is_logged` — patch `detect_threads` to raise `RuntimeError("test")`. Use `caplog` (pytest fixture) to capture log output. Assert the log contains the error message.
**Commit:** `fix: log swallowed exceptions in detect_threads try/except (T80.3)`.
#### 80.4 — Scene close `closed_at` chat-clock semantics
**Problem:** T58 uses `datetime.now(timezone.utc).isoformat()` for `closed_at` on emitted events instead of chat-clock time. Diverges from chat-clock semantics elsewhere.
**Fix:** read `chat["time"]` at the top of `apply_scene_close_summary` (already loaded for per-POV writes via `chat_clock_at`). Pass `chat_clock_at` through to `thread_closed` events' `closed_at` field instead of `datetime.now(...)`.
**Test added:** `test_thread_closed_uses_chat_clock_time` — seed a chat with `chat["time"] = "2026-04-26T10:00:00+00:00"`. Mock `detect_threads` to return one `ThreadCandidate(action="close", existing_thread_id="thr_x")`. Close the scene. Inspect the emitted `thread_closed` event payload — assert `closed_at == "2026-04-26T10:00:00+00:00"`, NOT a `datetime.now(...)` value.
**Commit:** `fix: thread_closed uses chat-clock time, not wall clock (T80.4)`.
#### 80.5 — T58 test coverage gaps
**Spec:** Per T58 review, these specific scenarios lack coverage. Add tests:
1. `test_key_quote_truncation_at_200_chars` — seed a memory with significance 2 and a 500-char `pov_summary`. Close scene. Assert the key-quote bullet for that memory is exactly 200 chars (or 199 + ellipsis, depending on the truncation impl).
2. `test_thread_detection_update_candidate_emits_thread_updated` — mock `detect_threads` to return `ThreadCandidate(action="update", existing_thread_id="thr_x", summary="updated")`. Close. Assert a `thread_updated` event landed.
3. `test_thread_detection_close_candidate_emits_thread_closed` — same setup with `action="close"`. Assert a `thread_closed` event landed.
(The `try/except` fallback test from 80.3 covers the exception path; no separate test needed.)
**Commit:** `test: T58 coverage gaps (truncation, update/close paths) (T80.5)`.
#### Verification gates for T80
- All 5 sub-fix commits cleanly split per `git show`.
- All 13 existing per_pov_summary tests still pass.
- 5+ new tests pass.
- Full suite green.
- `git diff phase-3.5 --stat` shows ONLY `chat/services/scene_summarize.py` and `tests/test_per_pov_summary.py`.
---
## Wave 3 — Typed exception for skip controllers (single task)
### Task 81: `ChatNotFoundError` for skip route routing
**Files:**
- Modify: `chat/web/skip.py` — define a typed exception and raise it instead of generic `ValueError("chat not found")`.
- Modify: `chat/web/drawer.py` — replace string-prefix sniff with `except ChatNotFoundError: 404`.
- Modify: `tests/test_drawer_events_threads_skip.py` (add 1 test).
**Spec:** Per T62 review. Drawer skip routes use `str(exc).startswith("chat not found")` to distinguish 404 from 400. Fragile if error wording changes. Replace with a typed exception subclass.
```python
# chat/web/skip.py top:
class ChatNotFoundError(Exception):
"""Raised when a chat_id doesn't resolve. Maps to 404 in HTTP routes."""
# Inside process_elision_skip / process_jump_skip:
chat = get_chat(conn, chat_id)
if chat is None:
raise ChatNotFoundError(f"chat {chat_id!r} not found")
# Other validation failures still raise ValueError (mapped to 400).
```
```python
# chat/web/drawer.py skip routes:
try:
result = await process_elision_skip(...)
except ChatNotFoundError:
raise HTTPException(status_code=404, detail="chat not found")
except ValueError as exc:
raise HTTPException(status_code=400, detail=str(exc))
```
**Test added:** `test_skip_route_raises_404_via_typed_exception` — POST to `/chats/nonexistent/drawer/skip/elision`; assert response 404. (Existing tests should still pass — they check 404 for missing chats and 400 for validation errors.)
**Commit:** `fix: typed ChatNotFoundError replaces string-prefix sniff in skip routes (T81)`.
---
## Wave 4 — `turns.py` wiring (single task)
### Task 82: post_turn meanwhile-digest consumption + skip-command scene close
**Files:**
- Modify: `chat/web/turns.py`
- Modify: `tests/test_turn_flow.py` (add 2 tests)
- Modify: `tests/test_phase3_integration.py` (T66 test 4 currently calls `consume_pending_meanwhile_digests` directly — update it to assert the helper is now invoked AUTOMATICALLY by post_turn)
**Spec:** Two related fixes both touching `post_turn`.
#### 82.1 — Wire `consume_pending_meanwhile_digests` into post_turn
**Problem (T66):** `chat/services/prompt.py::consume_pending_meanwhile_digests` is defined but never called. Meanwhile digests stay pending forever in production.
**Fix:** call the helper at the END of `post_turn` (after the assistant_turn lands and all classifier passes finish, BEFORE the response returns). The helper appends `meanwhile_digest_consumed` events for each pending digest, marking them as surfaced. Idempotent — re-calling produces zero events.
Place the call only when the chat has just-surfaced digests. The simplest pattern: always call after the primary turn — the helper short-circuits when nothing's pending. (Cost: 1 trivial DB query per turn.)
```python
from chat.services.prompt import consume_pending_meanwhile_digests
# Near the end of post_turn, before publish/return:
consume_pending_meanwhile_digests(conn, chat_id)
```
**Test added:** `test_post_turn_consumes_pending_meanwhile_digests` — seed a pending digest. POST a turn. Assert: a `meanwhile_digest_consumed` event landed in event_log AND `list_pending_meanwhile_digests` is now empty.
**Update existing T66 test 4:** the integration test currently calls `consume_pending_meanwhile_digests` directly to drive the consumption side-effect. After T82, the consumption happens automatically inside post_turn. Remove the explicit call in the test (or assert the count even without calling explicitly).
**Commit:** `fix: post_turn consumes pending meanwhile digests (T82.1)`.
#### 82.2 — Skip command runs scene close detection
**Problem (T62):** A user typing "skip an hour" via natural-language skip causes the skip path to bypass scene close detection. The user's prose may have signaled close intent ("fade out, skip an hour") but the skip path returns 204 without checking.
**Fix:** in the natural-language skip dispatch (T62 `intent="skip_elision"` branch), call `detect_scene_close` BEFORE the skip controller runs. If close is detected, append `scene_closed` first, then run the skip flow.
Order matters: scene close → skip narration → time advance. The narration should reflect the new scene start at the new time.
```python
if parsed.intent == "skip_elision":
# Run scene close detection first - the user may have signaled close.
if scene := active_scene(conn, chat_id):
close_decision = await detect_scene_close(client, classifier_model=...,
prose=parsed.prose, ...)
if close_decision.should_close:
append_and_apply(conn, kind="scene_closed", payload={"scene_id": scene["id"], ...})
await apply_scene_close_summary(conn, client, ...)
# ... existing skip dispatch ...
```
**Test added:** `test_natural_language_skip_with_close_signal_closes_scene` — user prose "fade out, skip to morning". Mock `detect_scene_close` to return True. Mock `narrate_skip`. Assert: `scene_closed` event lands AND `time_skip_elision` event lands AND ordering is `scene_closed` BEFORE `time_skip_elision`.
**Commit:** `fix: natural-language skip runs scene close detection (T82.2)`.
#### Verification gates for T82
- 2 new tests pass.
- All existing turn_flow + phase3_integration tests still pass (T66 test 4 may need a small update; document in commit).
- Full suite green.
- `git diff phase-3.5 --stat`: only `chat/web/turns.py`, `tests/test_turn_flow.py`, `tests/test_phase3_integration.py`.
---
## Wave 5 — `regenerate.py` polish bundle (single task)
T83 bundles 5 regenerate-related backlog items. All touch `chat/services/regenerate.py`. Single task by hot-file constraint; 5 internal commits.
### Task 83: regenerate.py polish
**Files:**
- Modify: `chat/services/regenerate.py`
- Optional: create `chat/services/turn_common.py` (new shared-helpers module for 83.2 DRY extraction)
- Modify: `chat/web/turns.py` (only if 83.2 extracts shared helpers — minimal)
- Modify: `tests/test_regenerate.py` (add 5+ tests)
**Spec:** Five sub-fixes per the T73 + T61 reviews.
#### 83.1 — Cancel/stop hook for in-flight regenerate streams
**Problem:** `post_turn` registers stream tasks in `_in_flight_tasks` so `/turns/cancel` cancels them. Regenerate doesn't.
**Fix:** mirror the registration pattern from `post_turn` (Phase 2.5 T74.4 documented this). Wrap the regenerate stream in `asyncio.create_task(...)`, register in `_in_flight_tasks[chat_id]`, await, unregister in `finally`.
**Test added:** `test_regenerate_registers_task_in_in_flight_tasks` — mid-stream snapshot pattern (mirror Phase 3 T64.fix-up's `_SnapshotMock`). Assert the chat_id is registered during streaming.
**Commit:** `feat: regenerate registers stream task in _in_flight_tasks for cancellation (T83.1)`.
#### 83.2 — DRY: extract recent-dialogue + prior-edges helpers
**Problem:** Recent-dialogue assembly + prior-edges block are duplicated between `chat/services/regenerate.py` and `chat/web/turns.py`. Diff drift risk.
**Fix:** create `chat/services/turn_common.py` with:
```python
def read_recent_dialogue(conn, chat_id: str, limit: int = 50) -> list[dict]:
"""Pull the last N user_turn + assistant_turn rows for the chat as
structured [{speaker, text}, ...]. Filters superseded_by IS NULL
AND hidden = 0. Used by post_turn AND regenerate AND scene_summarize."""
def gather_prior_edges(conn, chat_id: str, present_ids: list[str]) -> dict[tuple, dict]:
"""Build {(src, tgt): {affinity, trust, summary}} dict for all directed
pairs where both endpoints are in present_ids."""
```
Update `chat/web/turns.py::post_turn` to use the new helpers. Update `chat/services/regenerate.py::regenerate_assistant_turn` to use them too. Existing tests should pass unchanged.
**Test added:** `tests/test_turn_common.py` (NEW file) with 2-3 tests: dialogue read filters superseded; prior_edges builds correct keys.
**Commit:** `refactor: extract turn_common helpers from regenerate + turns (T83.2)`.
#### 83.3 — Sibling-discovery query optimization
**Problem (T73):** `regenerate.py`'s sibling-assistant-turn lookup scans ALL non-superseded `assistant_turn` rows globally with no `chat_id` predicate.
**Fix:** add a `json_extract(payload_json, '$.chat_id') = ?` predicate in the SQL query, plus `LIMIT 50` for safety. Mirror Phase 3 T64's `_last_meanwhile_speaker` SQL pattern.
**Test added:** `test_regenerate_sibling_lookup_scoped_to_chat` — seed two chats both with regenerate-able turn groups. Regenerate in chat A. Assert the sibling lookup didn't return any chat-B rows (verify via SQL query interception or by ensuring cross-chat sibling pollution wouldn't break).
**Commit:** `perf: scope regenerate sibling-lookup to chat_id (T83.3)`.
#### 83.4 — Lifecycle-transition rollback on regenerate
**Problem (T61):** Regenerate doesn't roll back lifecycle transitions from the superseded turn. `event_started`/`event_completed` rows from a superseded turn remain. May double-emit promotion artifacts.
**Fix (Phase 3.5 first cut — minimal):** at the START of regenerate, scan the superseded turn's downstream events. For any `event_started`/`event_completed`/`event_cancelled` linked to the superseded turn (via `superseded_by` chain on the original assistant_turn AND a back-pointer in the lifecycle event payload), revert the projected state by appending compensating events:
- Original was `event_started` for evt_x → emit `event_started_undone` (NEW event kind?) OR mark the original superseded.
- Original was `event_completed` → similarly.
**This is too invasive for one Phase 3.5 commit.** Instead, the simpler Phase 3.5 fix: **document the limitation explicitly** in the regenerate flow with a TODO + add an `assert` or warning when regenerate detects prior lifecycle transitions on the superseded turn. Real rollback support is a Phase 4 item.
**Test added:** `test_regenerate_with_prior_lifecycle_logs_warning` — seed a turn that emits `event_completed`. Regenerate. Assert a warning log entry mentions the un-rolled-back transition.
**Commit:** `chore: document regenerate lifecycle-rollback limitation (T83.4)`.
(If the implementer judges that proper rollback IS feasible in the time available, they can implement it instead — but the safe Phase 3.5 default is documentation + warning. The choice is the implementer's; report which path was taken.)
#### 83.5 — Event-detection ordering symmetry
**Problem (T61):** `post_turn` runs lifecycle BETWEEN interjection and scene-close; `regenerate.py` runs lifecycle at the END. Benign because regenerate has no scene-close path, but worth tidying.
**Fix:** move the event-detection block in `regenerate.py` to mirror `post_turn`'s position (between interjection regenerate and the function's tail). Cosmetic — no behavior change.
**Test:** verify all existing regenerate tests still pass after the move.
**Commit:** `refactor: regenerate event-detection ordering mirrors post_turn (T83.5)`.
#### Verification gates for T83
- 5 internal commits visible.
- All 6 existing regenerate tests still pass.
- All 10+ existing turn_flow tests still pass.
- 5+ new tests pass.
- Full suite green.
- `git diff phase-3.5 --stat`: `chat/services/regenerate.py`, `chat/services/turn_common.py` (if extracted), `chat/web/turns.py` (only if helper-call sites updated), `tests/test_regenerate.py`, `tests/test_turn_common.py` (if new).
---
## Wave 6 — Final polish (parallel, 3 tasks)
### Task 84: Unified record-memory API
**Files:**
- Modify: `chat/services/memory_write.py`
- Modify: `tests/test_memory_write.py` (add tests)
- Modify: `chat/web/meanwhile.py` (call site update — minimal)
**Spec:** Per T64 review. `record_meanwhile_memory` and `record_turn_memory_for_present` share private `_write_one_memory` but expose two separate public APIs. Unify:
```python
def record_turn_memory(
conn,
*,
chat_id: str,
host_bot_id: str,
guest_bot_id: str | None,
narrative_text: str,
scene_id: int | None = None,
chat_clock_at: str | None = None,
source: str = "direct",
significance: int = 1,
you_present: bool = True, # NEW: False for meanwhile turns
) -> dict[str, tuple[int, int | None]]:
"""Single entry-point for memory writes. When you_present is True,
witnesses are [you=1, host=1, guest if present]. When False (meanwhile),
witnesses are [you=0, host=1, guest=1] and host_bot_id + guest_bot_id
are both required."""
```
`record_meanwhile_memory` becomes a thin wrapper (kept for backward compat) that calls `record_turn_memory(..., you_present=False)`. `record_turn_memory_for_present` is renamed/aliased to `record_turn_memory` for the same reason.
Update `chat/web/meanwhile.py` to call the unified function with `you_present=False`. Verify the existing meanwhile turn tests still pass.
**Tests added:** 2 new tests in `tests/test_memory_write.py`:
- `test_record_turn_memory_you_present_false_writes_meanwhile_witness_mask` — assert witness `[0, 1, 1]`.
- `test_record_turn_memory_you_present_true_default_writes_normal_witness_mask` — assert `[1, 1, 0|1]`.
**Commit:** `refactor: unified record_turn_memory API with you_present kwarg (T84)`.
---
### Task 85: JSON-build audit + meanwhile stop-button route-level test
**Files:**
- Audit: read all `chat/state/*.py` for f-string JSON construction. NO PRODUCTION CHANGES unless an issue is found.
- Modify (only if issue found): the affected state module.
- Modify: `tests/test_meanwhile_turn_flow.py` (add stop-button route-level test).
**Spec:** Two unrelated minor follow-ups bundled by being small.
#### 85.1 — JSON-build audit (T63)
T63 originally used f-string interpolation for `participants_json` (fixed during review). Audit other state modules for the same pattern. Grep:
```bash
grep -rn 'f["\']\[' chat/state/ chat/services/ chat/eventlog/
```
For each f-string that constructs a JSON string from user-controllable data, replace with `json.dumps(...)`. If NONE found (likely — most code already uses `json.dumps`), add a one-line note in the commit body confirming the audit.
If issues are found, fix them and add tests asserting JSON validity for inputs containing quote/backslash characters.
**Commit:** `chore: audit JSON-string construction sites; replace any f-string usages with json.dumps (T85.1)` (or "audit confirms no issues").
#### 85.2 — Meanwhile stop-button route-level test
T64's fix-up registered stream tasks in `_in_flight_tasks`. There's a unit test pinning registration but no test that drives `/turns/cancel` against a meanwhile turn.
**Test added:** `test_meanwhile_turn_can_be_cancelled_via_route` — start a meanwhile turn (POST /turns), capture the in-flight task, immediately POST /turns/cancel, assert the task transitions to cancelled state and the assistant_turn payload has `truncated=True`.
This test may be tricky to write reliably (timing). If the existing `_SnapshotMock` pattern doesn't support mid-stream cancel injection, document the limitation and add a smaller test that just verifies the cancel endpoint works on the chat (asserts no tasks remaining after cancel).
**Commit:** `test: meanwhile turn cancellation via /turns/cancel route (T85.2)`.
---
### Task 86: Frontend `turn_html_replace` SSE handler
**Files:**
- Modify: `chat/templates/chat.html` (or wherever the chat page's SSE consumer lives) — add a JS event handler for `turn_html_replace`.
- Optional: `chat/static/app.css` (if styling needs adjustment for replaced turns).
- Modify: `tests/test_streaming_ux.py` (add 1 test).
**Spec:** Per Phase 2.5 T73.1 (carry-over). Regenerate's backend broadcasts a `turn_html_replace` SSE event but no live tab swaps the prior turn. Wire the JS:
```html
<!-- In chat.html, near the existing SSE setup: -->
<script>
const sse = new EventSource('/chats/{{ chat.id }}/events');
// Existing: turn_html appends new turn HTML.
sse.addEventListener('turn_html', (ev) => {
const turnsContainer = document.getElementById('turns');
turnsContainer.insertAdjacentHTML('beforeend', ev.data);
});
// NEW: turn_html_replace swaps an existing turn's DOM in place.
sse.addEventListener('turn_html_replace', (ev) => {
const data = JSON.parse(ev.data);
// data: {html, turn_id, supersedes_id}
const oldNode = document.getElementById(`turn-${data.supersedes_id}`);
if (oldNode) {
const tmpl = document.createElement('template');
tmpl.innerHTML = data.html.trim();
oldNode.replaceWith(tmpl.content.firstChild);
} else {
// Fallback: append if the prior turn isn't in the DOM (edge case).
document.getElementById('turns').insertAdjacentHTML('beforeend', data.html);
}
});
</script>
```
(Adapt the actual template structure — read `chat/templates/chat.html` first to see the existing SSE setup. The `turn-{id}` DOM-id pattern needs to be consistent with how the backend renders turn HTML — verify by checking `chat/web/render.py`.)
**Test added:** Hard to write a true browser-driven test in pytest. The simplest pin: a unit test that asserts the rendered HTML for a regenerate response contains the `turn_html_replace` SSE event AND the event payload contains the expected `supersedes_id`. (Already exists in T73.1 tests — don't duplicate.) Optionally add a Selenium/playwright test if the project has frontend testing infrastructure.
If pytest-only is the available coverage, document the manual smoke step in CLAUDE.md ("multi-tab regenerate" smoke test) and rely on backend pin tests + manual verification.
**Commit:** `feat: frontend turn_html_replace SSE handler (T86)`.
---
## Wave 7 — Docs sweep (single task)
### Task 87: CLAUDE.md backlog burn-down
**Files:**
- Modify: `CLAUDE.md`
**Spec:** Walk through the Phase 2.6/3 backlog and Phase 3.5/4 backlog sections. For each item shipped in T76T86, remove from backlog list. Add a new section "Phase 3.5 status" near the existing "Phase 3 status" listing what shipped. Add a new "Phase 3.6 / 4 backlog" section if any new items were discovered during T76T86 reviews (likely some — the review cycle always surfaces fresh follow-ups).
If any task during T76T86 chose to defer a sub-item (e.g., T83.4 chose documentation over rollback impl), keep that sub-item in the new "Phase 3.6+ deferred" section with the rationale.
Mark §13 Phase 3 deliverables in the requirements doc as **shipped** (already done in T67) — verify and don't re-mark.
**Commit:** `docs: phase 3.5 status, prune shipped backlog items (T87)`.
---
## Wrap-up
After Wave 7 lands:
1. **Run full suite** on `phase-3.5`: should be ~330+ tests passing (315 from Phase 3 + ~15-25 new across the wave).
2. **Manual smoke** (recommended before opening the PR):
- Multi-tab regenerate: open two tabs on the same chat; click Regenerate on one; verify the other tab swaps the regenerated turn live (T86).
- Trigger a meanwhile scene; close it; resume the main chat; verify the digest renders ONCE on the next you-turn AND is consumed automatically (T82.1).
- Type "fade out, skip an hour" — verify both scene close + skip fire in correct order (T82.2).
- Plan an event, complete it via a turn, then regenerate that turn — observe the warning log mentioning un-rolled-back lifecycle transitions (T83.4).
3. **Push `phase-3.5`** to gitea.
4. **Open PR** `phase-3.5 → main`.
5. **Phase 3.6+ residuals** (track in CLAUDE.md): genuine lifecycle-rollback impl, structured fixture builder for canned-queue tests, scene-close-on-cancel revisit if play-testing surfaces a regression.
---
## Notes for the controller running this plan
- **Don't dispatch Wave 5 until Wave 4 is merged AND green.** T82 (turns.py) + T83 (regenerate.py + optionally turns.py for shared-helper extraction) both touch `turns.py`. Sequential ordering prevents merge conflict.
- **After each parallel wave**, run a code-review subagent. Combined spec+quality review is acceptable for trivial tasks (T76, T77, T78, T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83) — each bundles 4-5 sub-fixes.
- **Token-spend rough estimate**: Phase 3.5 should be ~50-60% the size of Phase 3 (smaller scope, all reuse, no new schema). Per-task spend similar to Phase 2.5's bundled tasks.
- **DO NOT break existing v1/v2/v3 surface contracts.** Every test file that was green at the start of Phase 3.5 must stay green at the end. The cross-feature integration tests (`tests/test_phase3_integration.py`) are particularly load-bearing — if any of T80/T82/T83 break them, that's a regression to investigate, NOT to suppress.
@@ -0,0 +1,19 @@
{
"planPath": "docs/plans/2026-04-26-v3.5-phase3.5-cleanup.md",
"tasks": [
{"id": 76, "subject": "T76: narrate_skip timeout_s plumbed through", "status": "pending", "wave": 1, "parallelGroup": "wave-1"},
{"id": 77, "subject": "T77: AddresseeDecision.confidence as Literal", "status": "pending", "wave": 1, "parallelGroup": "wave-1"},
{"id": 78, "subject": "T78: search_memories docstring SQL-bias note", "status": "pending", "wave": 1, "parallelGroup": "wave-1"},
{"id": 79, "subject": "T79: _witness_role_for defensive None handling", "status": "pending", "wave": 1, "parallelGroup": "wave-1"},
{"id": 80, "subject": "T80: scene_summarize.py polish bundle (re-close suffix + transcript scoping + log + clock + tests)", "status": "pending", "wave": 2, "parallelGroup": null},
{"id": 81, "subject": "T81: ChatNotFoundError replaces string-prefix sniff in skip routes", "status": "pending", "wave": 3, "parallelGroup": null},
{"id": 82, "subject": "T82: post_turn wiring (consume_pending_meanwhile_digests + skip command scene close)", "status": "pending", "wave": 4, "parallelGroup": null},
{"id": 83, "subject": "T83: regenerate.py polish (cancel hook + DRY + sibling query + lifecycle rollback note + ordering)", "status": "pending", "wave": 5, "parallelGroup": null, "blockedBy": [82]},
{"id": 84, "subject": "T84: unified record_turn_memory API with you_present kwarg", "status": "pending", "wave": 6, "parallelGroup": "wave-6", "blockedBy": [83]},
{"id": 85, "subject": "T85: JSON-build audit + meanwhile stop-button route-level test", "status": "pending", "wave": 6, "parallelGroup": "wave-6", "blockedBy": [83]},
{"id": 86, "subject": "T86: frontend turn_html_replace SSE handler", "status": "pending", "wave": 6, "parallelGroup": "wave-6", "blockedBy": [83]},
{"id": 87, "subject": "T87: docs sweep — prune CLAUDE.md backlogs, capture residuals", "status": "pending", "wave": 7, "parallelGroup": null, "blockedBy": [84, 85, 86]}
],
"lastUpdated": "2026-04-26T00:00:00Z",
"notes": "12 tasks across 7 waves consolidating 17 backlog items (7 from Phase 2.6/3, 10 from Phase 3.5/4). Wave 1 (4-way parallel) and Wave 6 (3-way parallel) are file-disjoint. Waves 2, 3, 4, 5, 7 are single-task by hot-file constraint. Bundled tasks (T80, T82, T83) split into per-item sub-commits for clean review bisection. No schema migrations — schema baseline stays at version 11. Uses task ids T76-T87 to avoid id collision with prior phases (Phase 1: T0-T35, Phase 2: T36-T48, Phase 3: T49-T67, Phase 2.5: T68-T75)."
}