Files

T

Joseph Doherty fb7e97260b docs: add Phase 3.5 cleanup plan (Phase 2.6/3 + 3.5/4 backlog burn-down)

12 tasks across 7 waves consolidating the 17-item backlog tracked in
CLAUDE.md (7 from Phase 2.6/3 + 10 from Phase 3.5/4). Items are
grouped by file ownership so each wave stays file-disjoint:

- Wave 1 (parallel 4-way): trivial single-line/single-file fixes
  (timeout_s plumbing, Literal type, docstring, defensive None)
- Wave 2 (single): scene_summarize.py polish bundle (5 T58 items)
- Wave 3 (single): typed ChatNotFoundError for skip routes
- Wave 4 (single): turns.py wiring (consume_pending_meanwhile_digests
  + natural-language skip runs scene close detection)
- Wave 5 (single): regenerate.py polish (cancel hook + DRY +
  sibling query + lifecycle rollback documentation + ordering)
- Wave 6 (parallel 3-way): unified record_turn_memory API + JSON
  audit + frontend turn_html_replace SSE handler
- Wave 7 (single): docs sweep

No schema migrations. Bundled tasks split into per-item sub-commits
for clean review bisection. Uses task ids T76-T87 to avoid collision
with prior phases (Phase 3 used T49-T67, Phase 2.5 used T68-T75).

2026-04-26 21:33:16 -04:00

39 KiB

Raw Blame History

Roleplay Engine — Phase 3.5 Cleanup Plan

For Claude: REQUIRED SUB-SKILL: Use superpowers-extended-cc:executing-plans to implement this plan task-by-task. Use the parallel-dispatch pattern documented under "Parallel-Execution Strategy" for waves that fan out to multiple subagents.

Goal: Burn down the combined Phase 2.6/3 + Phase 3.5/4 backlog tracked in CLAUDE.md. 17 follow-up items consolidated into 12 tasks (file-disjoint where possible) across 7 waves so several can run in parallel.

Architecture: No new architecture, no schema migrations, no new event kinds. Every change here is either a polish on an existing service/route, a refactor for DRY/typing/observability, or a UI affordance for state that already exists (frontend turn_html_replace consumer).

Tech Stack: Same as Phase 3. No new dependencies.

Source-of-truth references:

Backlog list: CLAUDE.md §"Phase 2.6 / 3 backlog" (7 items, 6 actionable + 1 deferred) + §"Phase 3.5 / 4 backlog" (~13 items grouped by source review) = 17 items net.
Conventions: CLAUDE.md §"Behavioral defaults" + §"Phase 3 status".
Phase 2.5 cleanup plan (style, file-bundling pattern): 2026-04-26-v2.5-phase2.5-cleanup.md.

When a task says "see §X", that's the requirements doc unless stated otherwise.

Pre-flight

Branch: create phase-3.5 from the latest main after Phase 3 has merged (it has — main is at 753cec3):

git checkout main && git pull && git checkout -b phase-3.5

Schema baseline: Phase 3 leaves the DB at version 11. Phase 3.5 adds no migrations. Schema-version assertion in tests/test_world.py stays at 11.

Pinned non-negotiables (carried forward):

State changes go through the event log. Use append_and_apply(conn, kind, payload) for the live path; apply_event only after a fresh append_event returning the new id.
Witness filter every memory read at SQL level (hard WHERE constraint; never a soft signal).
Edges are directed; botA → botB and botB → botA are independent records.
Per-POV scene summaries — never write omniscient narration.
TDD: every task starts with a failing test (or a regression test pinning existing contract before refactor).
One commit per task minimum. Tasks that bundle multiple backlog items SHOULD split commits within the task — one commit per item — so review can bisect cleanly.

Verification before claiming done: Use superpowers-extended-cc:verification-before-completion — run the test command, paste actual output. Don't assume green.

Backlog item → task mapping

17 items consolidated into 12 tasks by file ownership (so each wave's tasks stay file-disjoint). Bundled tasks split commits internally.

#	Backlog item	Source	Task
1	`narrate_skip` `timeout_s` not piped through	T53 review	T76
2	`AddresseeDecision.confidence` should be `Literal[...]`	T74 review (Phase 2.5 carry-over)	T77
3	`search_memories` docstring missing SQL-bias note	T57 review	T78
4	`_witness_role_for` defensive coding for `host_bot_id is None`	T71 review (Phase 2.5 carry-over)	T79
5	Scene close re-close suffix bloat risk	T58 review	T80
6	Thread detection transcript scoping (chat-wide vs scene-scoped)	T58 review	T80
7	Swallowed `Exception` in `detect_threads` try/except — log at debug	T58 review	T80
8	Scene close `closed_at` clock divergence	T58 review	T80
9	T58 test coverage gaps (truncation, update/close, fallback)	T58 review	T80
10	Error-message prefix sniff for 404 vs 400 routing in skip routes	T62 review	T81
11	`consume_pending_meanwhile_digests` not wired into `post_turn`	T66 integration tests	T82
12	Skip command bypasses scene close detection	T62 review	T82
13	Cancel/stop hook for in-flight regenerate streams	T73 review (Phase 2.5 carry-over)	T83
14	DRY: regenerate vs post_turn (recent-dialogue + prior-edges duplication)	T73 review (Phase 2.5 carry-over)	T83
15	Sibling-discovery query optimization in regenerate.py	T73 review (Phase 2.5 carry-over)	T83
16	Regenerate doesn't roll back lifecycle transitions from superseded turn	T61 review	T83
17	Asymmetry in event-detection ordering (turns.py vs regenerate.py)	T61 review	T83
18	`record_meanwhile_memory` / `record_turn_memory_for_present` unified API	T64 review	T84
19	`participants_json` JSON-build audit (other state modules)	T63 review	T85
20	Stop-button cancellation route-level coverage for meanwhile turns	T64 review	T85
21	Frontend handler for `turn_html_replace` SSE event	T73.1 review (Phase 2.5 carry-over)	T86
—	Docs sweep — remove shipped items, capture residuals	(this plan)	T87

Deferred (not in this plan):

Cross-feature canned-queue brittleness (Wave 6b cross-feature): would require a structured-fixture refactor across 3+ test files. Tracking but not in scope here — open as its own work if it surfaces a real test-maintenance pain.
Scene-close-on-cancel UX revisit (T74.3): pinned to existing behavior; no action needed unless real play surfaces a regression.

Parallel-Execution Strategy

Same pattern as Phases 2.5 and 3. Seven waves: parallel within each wave (file-disjoint), serial across waves.

How to dispatch a wave in parallel

Use the Agent tool with isolation: "worktree" so each subagent gets its own git worktree. (If the controlling session's working directory is not the chat repo, create worktrees manually with git worktree add .worktrees/<wave>-<task> -b <wave>/<task> phase-3.5 from inside the chat repo and pass the worktree path explicitly into each subagent prompt.)

In a single message, dispatch all tasks in the wave:

Agent({
  description: "Wave 1 — T76 narrate_skip timeout_s",
  subagent_type: "general-purpose",
  isolation: "worktree",
  prompt: "<full task text from below>",
})
Agent({ ...T77... })
Agent({ ...T78... })
Agent({ ...T79... })

After a wave completes

Each subagent returns its worktree path and commit SHA(s).
Run a spec + code-quality reviewer subagent on each completed task. Combined review acceptable for purely mechanical fixes (T76–T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83).
Merge the wave into phase-3.5 in any order (file-disjointness guarantees no conflict). Use --no-ff.
Run the full test suite on the merged phase-3.5. If red, the wave's mutual-independence assumption was violated — bisect, fix, re-merge.
Push phase-3.5 to gitea.
Optionally clean up worktrees: git worktree remove .worktrees/<branch> and git branch -D <branch>.

Conflict prevention checklist

For each parallel wave, verify the Files sections of all tasks have no overlapping paths. Hot files in this plan: chat/web/turns.py (T82 only), chat/services/regenerate.py (T83 only), chat/web/skip.py + chat/web/drawer.py (T81 only). Each is owned by exactly one task.

Failure recovery

If one subagent fails: cancel it, merge the others' successful work, re-dispatch the failed task as a single follow-up. Don't block the wave.

Why each wave is parallel-safe

Wave	Tasks	Hot files touched	Disjoint?
1	T76, T77, T78, T79	4 different services/state files; no overlap	✅
2	T80	`chat/services/scene_summarize.py`	(single task)
3	T81	`chat/web/skip.py` + `chat/web/drawer.py` (typed exception)	(single task)
4	T82	`chat/web/turns.py`	(single task)
5	T83	`chat/services/regenerate.py` (+ optional new shared-helpers module)	(single task)
6	T84, T85, T86	`chat/services/memory_write.py` (T84); audit + tests (T85); frontend HTML/JS (T86)	✅
7	T87	`CLAUDE.md`	(single task)

Task overview

Wave 1 ─┬─ T76: narrate_skip timeout_s plumbed through (skip_narration.py)
        ├─ T77: AddresseeDecision.confidence as Literal (addressee.py)
        ├─ T78: search_memories docstring SQL-bias note (memory.py)
        └─ T79: _witness_role_for defensive None handling (prompt.py)

Wave 2 ─── T80: scene_summarize.py polish bundle (5 T58 items)

Wave 3 ─── T81: typed exception for skip controllers (skip.py + drawer.py)

Wave 4 ─── T82: turns.py wiring (consume_pending_meanwhile_digests + skip-bypass scene close)

Wave 5 ─── T83: regenerate.py polish bundle (cancel hook + DRY + sibling query +
                lifecycle rollback + ordering asymmetry)

Wave 6 ─┬─ T84: memory_write.py unified record-memory API
        ├─ T85: JSON-build audit + meanwhile stop-button route-level test
        └─ T86: frontend turn_html_replace SSE handler

Wave 7 ─── T87: docs sweep — prune CLAUDE.md backlogs, capture residuals

Critical path: 7 sequential merge points. Total tasks: 12. Wall-clock parallelism advantage: Waves 1 and 6 dispatch concurrently (4-way and 3-way respectively). Waves 2, 3, 4, 5, 7 are single-task by hot-file constraint.

Wave 1 — Independent small fixes (parallel, 4 tasks)

All 4 tasks are tiny (1-line + tests). Fully file-disjoint.

Task 76: `narrate_skip` timeout_s plumbed through

Files:

Modify: chat/services/skip_narration.py
Modify: tests/test_skip_narration.py (add 1 test)

Spec: narrate_skip(timeout_s=60.0) accepts the parameter but doesn't pass it to client.generate(...). Per T53 review — fix:

# Inside narrate_skip:
result = await client.generate(
    [
        Message(role="system", content=system),
        Message(role="user", content=user),
    ],
    model=narrative_model,
    max_tokens=200,
    temperature=0.7,
    timeout_s=timeout_s,   # NEW: pipe through
)

If the underlying client.generate doesn't honor timeout_s (Featherless client signature accepts **params so it'll ride through harmlessly), this is a no-op at runtime but documents intent. Read chat/llm/featherless.py::FeatherlessClient.generate to confirm.

Test added: test_narrate_skip_passes_timeout_through — capture the kwargs passed to client.generate via a custom mock that records them; assert timeout_s is present with the expected value.

Commit: fix: plumb narrate_skip timeout_s through to client.generate (T76).

Task 77: `AddresseeDecision.confidence` as Literal type

Files:

Modify: chat/services/addressee.py
Modify: tests/test_addressee.py (existing tests should still pass; add 1 test for the typed validation)

Spec: Per T74 review (Phase 2.5 carry-over). AddresseeDecision.confidence is currently str with a comment noting valid values. Tighten to Literal["high", "medium", "low"]:

from typing import Literal

class AddresseeDecision(BaseModel):
    addressee_id: str
    confidence: Literal["high", "medium", "low"] = "medium"
    reason: str = ""

Pydantic will reject classifier output with values outside the literal set, falling back via the default= in classify(...). The existing fallback path returns confidence="low" so no behavior change for legitimate calls.

Test added: test_invalid_confidence_value_falls_back_to_default — feed canned classifier output with "confidence": "VERY_HIGH"; assert the result is the default fallback (Pydantic validation failed → classify falls back).

Commit: fix: AddresseeDecision.confidence as Literal[high|medium|low] (T77).

Task 78: `search_memories` docstring mentions SQL-bias

Files:

Modify: chat/state/memory.py — extend the search_memories docstring with a one-liner about SIGNIFICANCE_RANK_BIAS.
No test changes needed (docstring-only).

Spec: Per T57 review. The current search_memories docstring describes only the Python-side composite re-rank with _SIGNIFICANCE_WEIGHT. T57 added a SQL-side SIGNIFICANCE_RANK_BIAS constant. Add to the docstring:

... (existing prose) ...

The result ordering applies TWO independent significance boosts:
- SQL-side: ``ORDER BY (rank - significance * SIGNIFICANCE_RANK_BIAS)``
  pushes higher-significance memories ahead in the FTS5 candidate set.
- Python-side: a composite re-rank with `_SIGNIFICANCE_WEIGHT` reinforces
  the ordering after candidate retrieval.

Commit: docs: search_memories docstring mentions SQL-side significance bias (T78).

Task 79: `_witness_role_for` defensive None handling

Files:

Modify: chat/services/prompt.py
Modify: tests/test_prompt.py (add 1 test)

Spec: Per T71 review (Phase 2.5 carry-over). Current _witness_role_for(speaker_bot_id, host_bot_id) returns "host" if speaker_bot_id == host_bot_id else "guest". When host_bot_id is None (degenerate case — can happen if a chat row is half-seeded in a test), this returns "guest" even though the speaker is logically the host. Defensive fix:

def _witness_role_for(speaker_bot_id: str, host_bot_id: str | None) -> str:
    """..."""
    if host_bot_id is None or speaker_bot_id == host_bot_id:
        return "host"
    return "guest"

Test added: test_witness_role_for_none_host_returns_host — call helper with host_bot_id=None; assert returns "host".

Commit: fix: _witness_role_for defensive None handling (T79).

Wave 2 — `scene_summarize.py` polish bundle (single task)

T80 bundles 5 backlog items (4 fixes + 1 test gap) all touching chat/services/scene_summarize.py and its test file. Single task by hot-file constraint; split into 5 commits internally for clean review bisection.

Task 80: scene_summarize.py polish

Files:

Modify: chat/services/scene_summarize.py
Modify: tests/test_per_pov_summary.py

Spec: Five sub-fixes per the T58 review.

80.1 — Re-close suffix bloat guard

Problem: _build_key_quotes_suffix reads from memories.pov_summary. If a scene close runs twice (replay, manual re-close, idempotent retry), the second pass reads the rewritten text plus the previous "Key quotes:" suffix and appends a second one — recursive bloat.

Fix (recommended option): source quotes from event_log assistant_turn/user_turn text instead of memories.pov_summary. The event-log text is immutable per turn, so re-close produces identical key-quote text.

Implementation: in _build_key_quotes_suffix(conn, scene_id), replace the SELECT from memories with a JOIN that pulls turn text from event_log filtered by scene_id (via payload_json JSON extraction), keeping the significance-from-memories filter.

If event-log queries prove too expensive or fragile, fall back to: at the start of _build_key_quotes_suffix, strip any existing "\n\nKey quotes:\n" suffix from each pov_summary before computing the new one. Document the choice in a code comment.

Test added: test_scene_close_re_run_does_not_double_suffix — close a scene with significance ≥ 2; assert each pov_summary contains "Key quotes:" exactly ONCE. Then call apply_scene_close_summary again on the same scene; assert each pov_summary STILL contains "Key quotes:" exactly ONCE (no second append).

Commit: fix: guard scene close key-quote suffix against re-close bloat (T80.1).

80.2 — Thread detection transcript scoping

Problem: _read_recent_dialogue returns chat-wide history with no scene_id filter. Feeding chat-wide history to detect_threads will misattribute threads to the closing scene when a scene boundary falls inside the last 50 turns.

Fix: add a scene_started_at lookup at the top of apply_scene_close_summary. Filter the transcript to turns where the event_log row's created_at (or chat_clock_at) is >= scene_started_at. Read chat/state/world.py::get_scene for the started_at column.

If get_scene doesn't expose started_at, query directly: SELECT started_at FROM scenes WHERE id = ?.

Test added: test_thread_detection_uses_scene_scoped_transcript — seed a scene with 3 turns. Then close the scene. Then start a new scene and seed 3 more turns. Then close the second scene. Mock detect_threads to capture its scene_transcript arg. Assert the second close's transcript contains ONLY the 3 second-scene turns (not the first scene's turns).

Commit: fix: scope thread detection transcript to closing scene (T80.2).

80.3 — Log swallowed exceptions in `detect_threads` try/except

Problem: T58's try/except Exception: around detect_threads swallows programmer errors silently.

Fix: add a logging.getLogger(__name__).debug("detect_threads failed: %s", exc, exc_info=True) (or appropriate level) inside the except block. Use the standard library logging module (it's already configured by Phase 1's chat.app).

Test added: test_detect_threads_failure_is_logged — patch detect_threads to raise RuntimeError("test"). Use caplog (pytest fixture) to capture log output. Assert the log contains the error message.

Commit: fix: log swallowed exceptions in detect_threads try/except (T80.3).

80.4 — Scene close `closed_at` chat-clock semantics

Problem: T58 uses datetime.now(timezone.utc).isoformat() for closed_at on emitted events instead of chat-clock time. Diverges from chat-clock semantics elsewhere.

Fix: read chat["time"] at the top of apply_scene_close_summary (already loaded for per-POV writes via chat_clock_at). Pass chat_clock_at through to thread_closed events' closed_at field instead of datetime.now(...).

Test added: test_thread_closed_uses_chat_clock_time — seed a chat with chat["time"] = "2026-04-26T10:00:00+00:00". Mock detect_threads to return one ThreadCandidate(action="close", existing_thread_id="thr_x"). Close the scene. Inspect the emitted thread_closed event payload — assert closed_at == "2026-04-26T10:00:00+00:00", NOT a datetime.now(...) value.

Commit: fix: thread_closed uses chat-clock time, not wall clock (T80.4).

80.5 — T58 test coverage gaps

Spec: Per T58 review, these specific scenarios lack coverage. Add tests:

test_key_quote_truncation_at_200_chars — seed a memory with significance 2 and a 500-char pov_summary. Close scene. Assert the key-quote bullet for that memory is exactly 200 chars (or 199 + ellipsis, depending on the truncation impl).
test_thread_detection_update_candidate_emits_thread_updated — mock detect_threads to return ThreadCandidate(action="update", existing_thread_id="thr_x", summary="updated"). Close. Assert a thread_updated event landed.
test_thread_detection_close_candidate_emits_thread_closed — same setup with action="close". Assert a thread_closed event landed.

(The try/except fallback test from 80.3 covers the exception path; no separate test needed.)

Commit: test: T58 coverage gaps (truncation, update/close paths) (T80.5).

Verification gates for T80

All 5 sub-fix commits cleanly split per git show.
All 13 existing per_pov_summary tests still pass.
5+ new tests pass.
Full suite green.
git diff phase-3.5 --stat shows ONLY chat/services/scene_summarize.py and tests/test_per_pov_summary.py.

Wave 3 — Typed exception for skip controllers (single task)

Task 81: `ChatNotFoundError` for skip route routing

Files:

Modify: chat/web/skip.py — define a typed exception and raise it instead of generic ValueError("chat not found").
Modify: chat/web/drawer.py — replace string-prefix sniff with except ChatNotFoundError: 404.
Modify: tests/test_drawer_events_threads_skip.py (add 1 test).

Spec: Per T62 review. Drawer skip routes use str(exc).startswith("chat not found") to distinguish 404 from 400. Fragile if error wording changes. Replace with a typed exception subclass.

# chat/web/skip.py top:
class ChatNotFoundError(Exception):
    """Raised when a chat_id doesn't resolve. Maps to 404 in HTTP routes."""

# Inside process_elision_skip / process_jump_skip:
chat = get_chat(conn, chat_id)
if chat is None:
    raise ChatNotFoundError(f"chat {chat_id!r} not found")

# Other validation failures still raise ValueError (mapped to 400).

# chat/web/drawer.py skip routes:
try:
    result = await process_elision_skip(...)
except ChatNotFoundError:
    raise HTTPException(status_code=404, detail="chat not found")
except ValueError as exc:
    raise HTTPException(status_code=400, detail=str(exc))

Test added: test_skip_route_raises_404_via_typed_exception — POST to /chats/nonexistent/drawer/skip/elision; assert response 404. (Existing tests should still pass — they check 404 for missing chats and 400 for validation errors.)

Commit: fix: typed ChatNotFoundError replaces string-prefix sniff in skip routes (T81).

Wave 4 — `turns.py` wiring (single task)

Task 82: post_turn meanwhile-digest consumption + skip-command scene close

Files:

Modify: chat/web/turns.py
Modify: tests/test_turn_flow.py (add 2 tests)
Modify: tests/test_phase3_integration.py (T66 test 4 currently calls consume_pending_meanwhile_digests directly — update it to assert the helper is now invoked AUTOMATICALLY by post_turn)

Spec: Two related fixes both touching post_turn.

82.1 — Wire `consume_pending_meanwhile_digests` into post_turn

Problem (T66): chat/services/prompt.py::consume_pending_meanwhile_digests is defined but never called. Meanwhile digests stay pending forever in production.

Fix: call the helper at the END of post_turn (after the assistant_turn lands and all classifier passes finish, BEFORE the response returns). The helper appends meanwhile_digest_consumed events for each pending digest, marking them as surfaced. Idempotent — re-calling produces zero events.

Place the call only when the chat has just-surfaced digests. The simplest pattern: always call after the primary turn — the helper short-circuits when nothing's pending. (Cost: 1 trivial DB query per turn.)

from chat.services.prompt import consume_pending_meanwhile_digests

# Near the end of post_turn, before publish/return:
consume_pending_meanwhile_digests(conn, chat_id)

Test added: test_post_turn_consumes_pending_meanwhile_digests — seed a pending digest. POST a turn. Assert: a meanwhile_digest_consumed event landed in event_log AND list_pending_meanwhile_digests is now empty.

Update existing T66 test 4: the integration test currently calls consume_pending_meanwhile_digests directly to drive the consumption side-effect. After T82, the consumption happens automatically inside post_turn. Remove the explicit call in the test (or assert the count even without calling explicitly).

Commit: fix: post_turn consumes pending meanwhile digests (T82.1).

82.2 — Skip command runs scene close detection

Problem (T62): A user typing "skip an hour" via natural-language skip causes the skip path to bypass scene close detection. The user's prose may have signaled close intent ("fade out, skip an hour") but the skip path returns 204 without checking.

Fix: in the natural-language skip dispatch (T62 intent="skip_elision" branch), call detect_scene_close BEFORE the skip controller runs. If close is detected, append scene_closed first, then run the skip flow.

Order matters: scene close → skip narration → time advance. The narration should reflect the new scene start at the new time.

if parsed.intent == "skip_elision":
    # Run scene close detection first - the user may have signaled close.
    if scene := active_scene(conn, chat_id):
        close_decision = await detect_scene_close(client, classifier_model=..., 
                                                  prose=parsed.prose, ...)
        if close_decision.should_close:
            append_and_apply(conn, kind="scene_closed", payload={"scene_id": scene["id"], ...})
            await apply_scene_close_summary(conn, client, ...)
    # ... existing skip dispatch ...

Test added: test_natural_language_skip_with_close_signal_closes_scene — user prose "fade out, skip to morning". Mock detect_scene_close to return True. Mock narrate_skip. Assert: scene_closed event lands AND time_skip_elision event lands AND ordering is scene_closed BEFORE time_skip_elision.

Commit: fix: natural-language skip runs scene close detection (T82.2).

Verification gates for T82

2 new tests pass.
All existing turn_flow + phase3_integration tests still pass (T66 test 4 may need a small update; document in commit).
Full suite green.
git diff phase-3.5 --stat: only chat/web/turns.py, tests/test_turn_flow.py, tests/test_phase3_integration.py.

Wave 5 — `regenerate.py` polish bundle (single task)

T83 bundles 5 regenerate-related backlog items. All touch chat/services/regenerate.py. Single task by hot-file constraint; 5 internal commits.

Task 83: regenerate.py polish

Files:

Modify: chat/services/regenerate.py
Optional: create chat/services/turn_common.py (new shared-helpers module for 83.2 DRY extraction)
Modify: chat/web/turns.py (only if 83.2 extracts shared helpers — minimal)
Modify: tests/test_regenerate.py (add 5+ tests)

Spec: Five sub-fixes per the T73 + T61 reviews.

83.1 — Cancel/stop hook for in-flight regenerate streams

Problem: post_turn registers stream tasks in _in_flight_tasks so /turns/cancel cancels them. Regenerate doesn't.

Fix: mirror the registration pattern from post_turn (Phase 2.5 T74.4 documented this). Wrap the regenerate stream in asyncio.create_task(...), register in _in_flight_tasks[chat_id], await, unregister in finally.

Test added: test_regenerate_registers_task_in_in_flight_tasks — mid-stream snapshot pattern (mirror Phase 3 T64.fix-up's _SnapshotMock). Assert the chat_id is registered during streaming.

Commit: feat: regenerate registers stream task in _in_flight_tasks for cancellation (T83.1).

83.2 — DRY: extract recent-dialogue + prior-edges helpers

Problem: Recent-dialogue assembly + prior-edges block are duplicated between chat/services/regenerate.py and chat/web/turns.py. Diff drift risk.

Fix: create chat/services/turn_common.py with:

def read_recent_dialogue(conn, chat_id: str, limit: int = 50) -> list[dict]:
    """Pull the last N user_turn + assistant_turn rows for the chat as
    structured [{speaker, text}, ...]. Filters superseded_by IS NULL
    AND hidden = 0. Used by post_turn AND regenerate AND scene_summarize."""

def gather_prior_edges(conn, chat_id: str, present_ids: list[str]) -> dict[tuple, dict]:
    """Build {(src, tgt): {affinity, trust, summary}} dict for all directed
    pairs where both endpoints are in present_ids."""

Update chat/web/turns.py::post_turn to use the new helpers. Update chat/services/regenerate.py::regenerate_assistant_turn to use them too. Existing tests should pass unchanged.

Test added: tests/test_turn_common.py (NEW file) with 2-3 tests: dialogue read filters superseded; prior_edges builds correct keys.

Commit: refactor: extract turn_common helpers from regenerate + turns (T83.2).

83.3 — Sibling-discovery query optimization

Problem (T73): regenerate.py's sibling-assistant-turn lookup scans ALL non-superseded assistant_turn rows globally with no chat_id predicate.

Fix: add a json_extract(payload_json, '$.chat_id') = ? predicate in the SQL query, plus LIMIT 50 for safety. Mirror Phase 3 T64's _last_meanwhile_speaker SQL pattern.

Test added: test_regenerate_sibling_lookup_scoped_to_chat — seed two chats both with regenerate-able turn groups. Regenerate in chat A. Assert the sibling lookup didn't return any chat-B rows (verify via SQL query interception or by ensuring cross-chat sibling pollution wouldn't break).

Commit: perf: scope regenerate sibling-lookup to chat_id (T83.3).

83.4 — Lifecycle-transition rollback on regenerate

Problem (T61): Regenerate doesn't roll back lifecycle transitions from the superseded turn. event_started/event_completed rows from a superseded turn remain. May double-emit promotion artifacts.

Fix (Phase 3.5 first cut — minimal): at the START of regenerate, scan the superseded turn's downstream events. For any event_started/event_completed/event_cancelled linked to the superseded turn (via superseded_by chain on the original assistant_turn AND a back-pointer in the lifecycle event payload), revert the projected state by appending compensating events:

Original was event_started for evt_x → emit event_started_undone (NEW event kind?) OR mark the original superseded.
Original was event_completed → similarly.

This is too invasive for one Phase 3.5 commit. Instead, the simpler Phase 3.5 fix: document the limitation explicitly in the regenerate flow with a TODO + add an assert or warning when regenerate detects prior lifecycle transitions on the superseded turn. Real rollback support is a Phase 4 item.

Test added: test_regenerate_with_prior_lifecycle_logs_warning — seed a turn that emits event_completed. Regenerate. Assert a warning log entry mentions the un-rolled-back transition.

Commit: chore: document regenerate lifecycle-rollback limitation (T83.4).

(If the implementer judges that proper rollback IS feasible in the time available, they can implement it instead — but the safe Phase 3.5 default is documentation + warning. The choice is the implementer's; report which path was taken.)

83.5 — Event-detection ordering symmetry

Problem (T61): post_turn runs lifecycle BETWEEN interjection and scene-close; regenerate.py runs lifecycle at the END. Benign because regenerate has no scene-close path, but worth tidying.

Fix: move the event-detection block in regenerate.py to mirror post_turn's position (between interjection regenerate and the function's tail). Cosmetic — no behavior change.

Test: verify all existing regenerate tests still pass after the move.

Commit: refactor: regenerate event-detection ordering mirrors post_turn (T83.5).

Verification gates for T83

5 internal commits visible.
All 6 existing regenerate tests still pass.
All 10+ existing turn_flow tests still pass.
5+ new tests pass.
Full suite green.
git diff phase-3.5 --stat: chat/services/regenerate.py, chat/services/turn_common.py (if extracted), chat/web/turns.py (only if helper-call sites updated), tests/test_regenerate.py, tests/test_turn_common.py (if new).

Wave 6 — Final polish (parallel, 3 tasks)

Task 84: Unified record-memory API

Files:

Modify: chat/services/memory_write.py
Modify: tests/test_memory_write.py (add tests)
Modify: chat/web/meanwhile.py (call site update — minimal)

Spec: Per T64 review. record_meanwhile_memory and record_turn_memory_for_present share private _write_one_memory but expose two separate public APIs. Unify:

def record_turn_memory(
    conn,
    *,
    chat_id: str,
    host_bot_id: str,
    guest_bot_id: str | None,
    narrative_text: str,
    scene_id: int | None = None,
    chat_clock_at: str | None = None,
    source: str = "direct",
    significance: int = 1,
    you_present: bool = True,   # NEW: False for meanwhile turns
) -> dict[str, tuple[int, int | None]]:
    """Single entry-point for memory writes. When you_present is True,
    witnesses are [you=1, host=1, guest if present]. When False (meanwhile),
    witnesses are [you=0, host=1, guest=1] and host_bot_id + guest_bot_id
    are both required."""

record_meanwhile_memory becomes a thin wrapper (kept for backward compat) that calls record_turn_memory(..., you_present=False). record_turn_memory_for_present is renamed/aliased to record_turn_memory for the same reason.

Update chat/web/meanwhile.py to call the unified function with you_present=False. Verify the existing meanwhile turn tests still pass.

Tests added: 2 new tests in tests/test_memory_write.py:

test_record_turn_memory_you_present_false_writes_meanwhile_witness_mask — assert witness [0, 1, 1].
test_record_turn_memory_you_present_true_default_writes_normal_witness_mask — assert [1, 1, 0|1].

Commit: refactor: unified record_turn_memory API with you_present kwarg (T84).

Task 85: JSON-build audit + meanwhile stop-button route-level test

Files:

Audit: read all chat/state/*.py for f-string JSON construction. NO PRODUCTION CHANGES unless an issue is found.
Modify (only if issue found): the affected state module.
Modify: tests/test_meanwhile_turn_flow.py (add stop-button route-level test).

Spec: Two unrelated minor follow-ups bundled by being small.

85.1 — JSON-build audit (T63)

T63 originally used f-string interpolation for participants_json (fixed during review). Audit other state modules for the same pattern. Grep:

grep -rn 'f["\']\[' chat/state/ chat/services/ chat/eventlog/

For each f-string that constructs a JSON string from user-controllable data, replace with json.dumps(...). If NONE found (likely — most code already uses json.dumps), add a one-line note in the commit body confirming the audit.

If issues are found, fix them and add tests asserting JSON validity for inputs containing quote/backslash characters.

Commit: chore: audit JSON-string construction sites; replace any f-string usages with json.dumps (T85.1) (or "audit confirms no issues").

85.2 — Meanwhile stop-button route-level test

T64's fix-up registered stream tasks in _in_flight_tasks. There's a unit test pinning registration but no test that drives /turns/cancel against a meanwhile turn.

Test added: test_meanwhile_turn_can_be_cancelled_via_route — start a meanwhile turn (POST /turns), capture the in-flight task, immediately POST /turns/cancel, assert the task transitions to cancelled state and the assistant_turn payload has truncated=True.

This test may be tricky to write reliably (timing). If the existing _SnapshotMock pattern doesn't support mid-stream cancel injection, document the limitation and add a smaller test that just verifies the cancel endpoint works on the chat (asserts no tasks remaining after cancel).

Commit: test: meanwhile turn cancellation via /turns/cancel route (T85.2).

Task 86: Frontend `turn_html_replace` SSE handler

Files:

Modify: chat/templates/chat.html (or wherever the chat page's SSE consumer lives) — add a JS event handler for turn_html_replace.
Optional: chat/static/app.css (if styling needs adjustment for replaced turns).
Modify: tests/test_streaming_ux.py (add 1 test).

Spec: Per Phase 2.5 T73.1 (carry-over). Regenerate's backend broadcasts a turn_html_replace SSE event but no live tab swaps the prior turn. Wire the JS:

<!-- In chat.html, near the existing SSE setup: -->
<script>
const sse = new EventSource('/chats/{{ chat.id }}/events');

// Existing: turn_html appends new turn HTML.
sse.addEventListener('turn_html', (ev) => {
  const turnsContainer = document.getElementById('turns');
  turnsContainer.insertAdjacentHTML('beforeend', ev.data);
});

// NEW: turn_html_replace swaps an existing turn's DOM in place.
sse.addEventListener('turn_html_replace', (ev) => {
  const data = JSON.parse(ev.data);
  // data: {html, turn_id, supersedes_id}
  const oldNode = document.getElementById(`turn-${data.supersedes_id}`);
  if (oldNode) {
    const tmpl = document.createElement('template');
    tmpl.innerHTML = data.html.trim();
    oldNode.replaceWith(tmpl.content.firstChild);
  } else {
    // Fallback: append if the prior turn isn't in the DOM (edge case).
    document.getElementById('turns').insertAdjacentHTML('beforeend', data.html);
  }
});
</script>

(Adapt the actual template structure — read chat/templates/chat.html first to see the existing SSE setup. The turn-{id} DOM-id pattern needs to be consistent with how the backend renders turn HTML — verify by checking chat/web/render.py.)

Test added: Hard to write a true browser-driven test in pytest. The simplest pin: a unit test that asserts the rendered HTML for a regenerate response contains the turn_html_replace SSE event AND the event payload contains the expected supersedes_id. (Already exists in T73.1 tests — don't duplicate.) Optionally add a Selenium/playwright test if the project has frontend testing infrastructure.

If pytest-only is the available coverage, document the manual smoke step in CLAUDE.md ("multi-tab regenerate" smoke test) and rely on backend pin tests + manual verification.

Commit: feat: frontend turn_html_replace SSE handler (T86).

Wave 7 — Docs sweep (single task)

Task 87: CLAUDE.md backlog burn-down

Files:

Modify: CLAUDE.md

Spec: Walk through the Phase 2.6/3 backlog and Phase 3.5/4 backlog sections. For each item shipped in T76–T86, remove from backlog list. Add a new section "Phase 3.5 status" near the existing "Phase 3 status" listing what shipped. Add a new "Phase 3.6 / 4 backlog" section if any new items were discovered during T76–T86 reviews (likely some — the review cycle always surfaces fresh follow-ups).

If any task during T76–T86 chose to defer a sub-item (e.g., T83.4 chose documentation over rollback impl), keep that sub-item in the new "Phase 3.6+ deferred" section with the rationale.

Mark §13 Phase 3 deliverables in the requirements doc as shipped (already done in T67) — verify and don't re-mark.

Commit: docs: phase 3.5 status, prune shipped backlog items (T87).

Wrap-up

After Wave 7 lands:

Run full suite on phase-3.5: should be ~330+ tests passing (315 from Phase 3 + ~15-25 new across the wave).
Manual smoke (recommended before opening the PR):
- Multi-tab regenerate: open two tabs on the same chat; click Regenerate on one; verify the other tab swaps the regenerated turn live (T86).
- Trigger a meanwhile scene; close it; resume the main chat; verify the digest renders ONCE on the next you-turn AND is consumed automatically (T82.1).
- Type "fade out, skip an hour" — verify both scene close + skip fire in correct order (T82.2).
- Plan an event, complete it via a turn, then regenerate that turn — observe the warning log mentioning un-rolled-back lifecycle transitions (T83.4).
Push phase-3.5 to gitea.
Open PR phase-3.5 → main.
Phase 3.6+ residuals (track in CLAUDE.md): genuine lifecycle-rollback impl, structured fixture builder for canned-queue tests, scene-close-on-cancel revisit if play-testing surfaces a regression.

Notes for the controller running this plan

Don't dispatch Wave 5 until Wave 4 is merged AND green. T82 (turns.py) + T83 (regenerate.py + optionally turns.py for shared-helper extraction) both touch turns.py. Sequential ordering prevents merge conflict.
After each parallel wave, run a code-review subagent. Combined spec+quality review is acceptable for trivial tasks (T76, T77, T78, T79). Separate spec + quality reviewers for bundled tasks (T80, T82, T83) — each bundles 4-5 sub-fixes.
Token-spend rough estimate: Phase 3.5 should be ~50-60% the size of Phase 3 (smaller scope, all reuse, no new schema). Per-task spend similar to Phase 2.5's bundled tasks.
DO NOT break existing v1/v2/v3 surface contracts. Every test file that was green at the start of Phase 3.5 must stay green at the end. The cross-feature integration tests (tests/test_phase3_integration.py) are particularly load-bearing — if any of T80/T82/T83 break them, that's a regression to investigate, NOT to suppress.

39 KiB Raw Blame History Unescape Escape

Roleplay Engine — Phase 3.5 Cleanup Plan

Pre-flight

Backlog item → task mapping

Parallel-Execution Strategy

How to dispatch a wave in parallel

After a wave completes

Conflict prevention checklist

Failure recovery

Why each wave is parallel-safe

Task overview

Wave 1 — Independent small fixes (parallel, 4 tasks)

Task 76: narrate_skip timeout_s plumbed through

Task 77: AddresseeDecision.confidence as Literal type

Task 78: search_memories docstring mentions SQL-bias

Task 79: _witness_role_for defensive None handling

Wave 2 — scene_summarize.py polish bundle (single task)

Task 80: scene_summarize.py polish

80.1 — Re-close suffix bloat guard

80.2 — Thread detection transcript scoping

80.3 — Log swallowed exceptions in detect_threads try/except

80.4 — Scene close closed_at chat-clock semantics

80.5 — T58 test coverage gaps

Verification gates for T80

Wave 3 — Typed exception for skip controllers (single task)

Task 81: ChatNotFoundError for skip route routing

Wave 4 — turns.py wiring (single task)

Task 82: post_turn meanwhile-digest consumption + skip-command scene close

82.1 — Wire consume_pending_meanwhile_digests into post_turn

82.2 — Skip command runs scene close detection

Verification gates for T82

Wave 5 — regenerate.py polish bundle (single task)

Task 83: regenerate.py polish

83.1 — Cancel/stop hook for in-flight regenerate streams

83.2 — DRY: extract recent-dialogue + prior-edges helpers

83.3 — Sibling-discovery query optimization

83.4 — Lifecycle-transition rollback on regenerate

83.5 — Event-detection ordering symmetry

Verification gates for T83

Wave 6 — Final polish (parallel, 3 tasks)

Task 84: Unified record-memory API

Task 85: JSON-build audit + meanwhile stop-button route-level test

85.1 — JSON-build audit (T63)

85.2 — Meanwhile stop-button route-level test

Task 86: Frontend turn_html_replace SSE handler

Wave 7 — Docs sweep (single task)

Task 87: CLAUDE.md backlog burn-down

Wrap-up

Notes for the controller running this plan

39 KiB

Raw Blame History

Task 76: `narrate_skip` timeout_s plumbed through

Task 77: `AddresseeDecision.confidence` as Literal type

Task 78: `search_memories` docstring mentions SQL-bias

Task 79: `_witness_role_for` defensive None handling

Wave 2 — `scene_summarize.py` polish bundle (single task)

80.3 — Log swallowed exceptions in `detect_threads` try/except

80.4 — Scene close `closed_at` chat-clock semantics

Task 81: `ChatNotFoundError` for skip route routing

Wave 4 — `turns.py` wiring (single task)

82.1 — Wire `consume_pending_meanwhile_digests` into post_turn

Wave 5 — `regenerate.py` polish bundle (single task)

Task 86: Frontend `turn_html_replace` SSE handler