Three gaps left by T58's initial test coverage:
* test_key_quote_truncation_at_200_chars — exercises the 200-char hard
slice in _build_key_quotes_suffix so any future change to the
truncation strategy (ellipsis, word boundary, etc) trips the test.
* test_thread_detection_update_candidate_emits_thread_updated —
pins the ``update`` action emission shape (thread_id, summary,
last_referenced_scene_id).
* test_thread_detection_close_candidate_emits_thread_closed — pins
the ``close`` action emission shape (thread_id, closed_at).
No production change; pure coverage add.
T58 stamped emitted ``thread_closed`` events with
``datetime.now(timezone.utc).isoformat()``. The rest of the close
pipeline (memories.chat_clock_at, scene_closed.ended_at, edge writes)
uses the chat's in-world clock. Threads must agree so timeline
reconstruction stays consistent under time skips and replay.
Read ``chat["time"]`` (already loaded for the per-POV path) and pass
it through as ``closed_at``. Falls back to UTC now only when chat_state
has no clock yet — defensive; chat_created always seeds it.
Adds test_thread_closed_uses_chat_clock_time.
The broad ``except Exception`` around detect_threads silently dropped
programmer errors (wrong kwargs, import-time failures, etc), making
diagnostics painful. Log at DEBUG with full exc_info so the failure
surfaces in local logs without breaking the close pipeline's
failure-tolerant contract.
Adds test_detect_threads_failure_is_logged using caplog.
apply_scene_close_summary fed detect_threads the chat-wide last-50
turns. When a chat has accumulated multiple scenes' worth of dialogue,
that bleeds prior-scene turns into the second close's classifier prompt
and risks mis-attributing threads (closing one that opened earlier,
re-opening one that already closed).
Add an optional ``since_event_id`` kwarg to ``_read_recent_dialogue``
that lower-bounds by event_log id, plus a ``_scene_opened_event_id``
helper that resolves the scene-open event for a given scene_id. Wire
both into the thread-detection call site so its scene_transcript
holds only the closing scene's turns. The per-POV summarizer keeps the
chat-wide approximation it had before — that's intentional.
Adds test_thread_detection_uses_scene_scoped_transcript.
Re-running apply_scene_close_summary on the same scene previously caused
recursive bloat: _build_key_quotes_suffix sourced quote text from
memories.pov_summary, which after the first close already carried a
"Key quotes:" suffix. The next close would then quote the quotes,
nesting deeper each time.
Strip any existing suffix from candidate text before truncating to
200 chars in the suffix builder, and from the fresh classifier output
before composing the new value in _summarize_and_apply_for_witness so
the rewrite replaces rather than stacks.
Adds test_scene_close_re_run_does_not_double_suffix.
The kickoff parse-and-confirm route was 500-ing intermittently because
Hermes-3 + Featherless's response_format={"type":"json_object"} only
guarantees JSON output, NOT a particular schema. The model was inventing
its own field names (sceneTime, entities, settingDetails) instead of
the KickoffParse fields, causing Pydantic validation to fail on both
classify() retries.
Three changes:
1. Include the Pydantic JSON schema in the system prompt so the model
knows exactly which keys to produce. Affects every classify() call
(kickoff parse, turn parse, scene-close detect, significance,
state-update, scene summarize). Strip ```json fences if the model
wraps its output. Bump retries 2 → 3 (model is stochastic; one extra
attempt closes most of the remaining gap).
2. parse_kickoff() now passes a default empty KickoffParse so the
route degrades to a fillable form instead of 500 when the classifier
ultimately fails. The confirm form is the human-in-the-loop; an
empty form is strictly better UX than a stack trace.
3. Tests updated: bumped canned-failure arrays from 2 → 3 entries to
match the new attempt count; renamed kickoff test from
"raises_when_classifier_fails_twice" to
"falls_back_to_empty_when_classifier_fails" reflecting the new
degraded-but-usable behavior.
Verified live with all 3 sample bots (maya/eli/sam) — kickoff route
returns 200 across multiple attempts. Full suite: 168 passed.