Phase 2: multi-entity scene support (you + host + guest) #2

Merged
dohertj2 merged 29 commits from phase-2 into main 2026-04-26 20:00:17 -04:00
Owner

Summary

Phase 2 adds 3-entity scene support to the roleplay engine: a guest bot can be added to a host's chat, with up to 3 entities present (you + host + guest). Built on top of Phase 1's event-sourced architecture — most schema changes are additive.

Note: this PR currently includes all of Phase 1 (PR #1). Once #1 merges, this PR auto-narrows to just Phase 2 deltas (~4.2k lines, 13 commits).

What shipped (13 tasks, 6 waves)

  • Wave 1 — schema & seed: group_node table + projector handlers; guest_added / guest_removed events; relationship-seed classifier ("have they met?")
  • Wave 2 — services: interjection classifier, multi-pair state-update coordinator (6 directed pairs with 3 entities), multi-witness memory write helper
  • Wave 3 — drawer: add/remove guest from drawer; renders guest section, group node summary, "Add guest" form with relationship prose
  • Wave 4a — prompt + close: assemble_narrative_prompt accepts guest_id (auto-detected from chat.guest_bot_id), adds guest activity + group dynamic blocks; per-POV scene close summaries for each present witness; group_node updated on close
  • Wave 4b — turn flow: post_turn rewrite — addressee detection (substring match), narrative for addressee, multi-witness memory writes, 6-pair state updates, conditional interjection beat from silent witness; regenerate.py mirrors changes (interjection regenerate deferred to 2.5)
  • Wave 5 — polish: witness filter cross-entity test coverage, bot_reset cascades to clear chats.guest_bot_id references in other chats, Phase 2 status + 2.5 backlog in CLAUDE.md

Architecture notes

  • Event-sourced: state changes flow through append_and_apply for the new event kinds. No projector replay regressions.
  • Backward compatible: when chat.guest_bot_id is None, all flows reduce exactly to Phase 1 behavior. All 168 Phase 1 tests pass unmodified.
  • Featherless 2-connection cap respected: state updates run sequentially within a turn (no asyncio.gather).

Test plan

  • Full pytest suite: 212 passing (168 Phase 1 + 44 Phase 2). 0 failures, 0 skips.
  • Each task verified in its own worktree before merge to phase-2.
  • Each task reviewed for spec compliance + code quality before merge.
  • Manual smoke (recommend before merging this PR):
    • Add a guest to one of the seeded bots' chats via the drawer
    • Verify "have they met?" prose seeds inter-bot edges (host->guest summary populated)
    • Play a few turns; verify host responds normally; verify guest occasionally interjects (should be rare)
    • Close the scene; check drawer for two distinct per-POV summaries
    • Remove guest mid-scene; verify scene_closed fires
    • Reset a guest bot from another chat; verify guest_bot_id reference clears

Phase 2.5 backlog (tracked in CLAUDE.md)

  • Interjection regenerate UI
  • Classifier-based addressee detection
  • LLM-merged group meta-summary (currently naive concat)
  • "First-meeting gate" — drawer's "have they met?" textarea fires every time
  • Witness flag editing in drawer (currently read-only)
  • Significance for interjection memories
  • Stale guest reference defensive degrade in post_turn (T44 added; T47 fixes root cause — degrade can be removed)
  • Scene close on cancel review
  • Dual ACTIVITIES: block consolidation in prompt
  • chat/services/prompt.py:436 hardcodes witness_role="host" regardless of speaker

Plan

docs/plans/2026-04-26-v2-phase2-implementation.md (committed in b833589).

## Summary Phase 2 adds 3-entity scene support to the roleplay engine: a guest bot can be added to a host's chat, with up to 3 entities present (you + host + guest). Built on top of Phase 1's event-sourced architecture — most schema changes are additive. **Note:** this PR currently includes all of Phase 1 (PR #1). Once #1 merges, this PR auto-narrows to just Phase 2 deltas (~4.2k lines, 13 commits). ## What shipped (13 tasks, 6 waves) - **Wave 1 — schema & seed:** `group_node` table + projector handlers; `guest_added` / `guest_removed` events; relationship-seed classifier ("have they met?") - **Wave 2 — services:** interjection classifier, multi-pair state-update coordinator (6 directed pairs with 3 entities), multi-witness memory write helper - **Wave 3 — drawer:** add/remove guest from drawer; renders guest section, group node summary, "Add guest" form with relationship prose - **Wave 4a — prompt + close:** `assemble_narrative_prompt` accepts `guest_id` (auto-detected from `chat.guest_bot_id`), adds guest activity + group dynamic blocks; per-POV scene close summaries for each present witness; group_node updated on close - **Wave 4b — turn flow:** `post_turn` rewrite — addressee detection (substring match), narrative for addressee, multi-witness memory writes, 6-pair state updates, conditional interjection beat from silent witness; `regenerate.py` mirrors changes (interjection regenerate deferred to 2.5) - **Wave 5 — polish:** witness filter cross-entity test coverage, `bot_reset` cascades to clear `chats.guest_bot_id` references in other chats, Phase 2 status + 2.5 backlog in `CLAUDE.md` ## Architecture notes - Event-sourced: state changes flow through `append_and_apply` for the new event kinds. No projector replay regressions. - Backward compatible: when `chat.guest_bot_id is None`, all flows reduce exactly to Phase 1 behavior. All 168 Phase 1 tests pass unmodified. - Featherless 2-connection cap respected: state updates run sequentially within a turn (no `asyncio.gather`). ## Test plan - [x] Full pytest suite: 212 passing (168 Phase 1 + 44 Phase 2). 0 failures, 0 skips. - [x] Each task verified in its own worktree before merge to phase-2. - [x] Each task reviewed for spec compliance + code quality before merge. - [ ] Manual smoke (recommend before merging this PR): - [ ] Add a guest to one of the seeded bots' chats via the drawer - [ ] Verify "have they met?" prose seeds inter-bot edges (host->guest summary populated) - [ ] Play a few turns; verify host responds normally; verify guest occasionally interjects (should be rare) - [ ] Close the scene; check drawer for two distinct per-POV summaries - [ ] Remove guest mid-scene; verify `scene_closed` fires - [ ] Reset a guest bot from another chat; verify `guest_bot_id` reference clears ## Phase 2.5 backlog (tracked in `CLAUDE.md`) - Interjection regenerate UI - Classifier-based addressee detection - LLM-merged group meta-summary (currently naive concat) - "First-meeting gate" — drawer's "have they met?" textarea fires every time - Witness flag editing in drawer (currently read-only) - Significance for interjection memories - Stale guest reference defensive degrade in `post_turn` (T44 added; T47 fixes root cause — degrade can be removed) - Scene close on cancel review - Dual `ACTIVITIES:` block consolidation in prompt - `chat/services/prompt.py:436` hardcodes `witness_role="host"` regardless of speaker ## Plan `docs/plans/2026-04-26-v2-phase2-implementation.md` (committed in `b833589`).
dohertj2 added 72 commits 2026-04-26 16:44:20 -04:00
- .gitignore: add *.egg-info/ so editable installs don't show in git status.
- pyproject.toml: add [build-system] and [tool.setuptools.packages.find]
  scoped to chat*, fixing pip install -e . which was failing on data/
  auto-discovery.
- CLAUDE.md: add Phase 1.5 cleanup backlog section under Phase 1 status,
  capturing the small follow-ups surfaced in implementer reviews
  (open_db refactor, regenerate SSE broadcast, you-activity purge,
  drawer edits for deferred fields, NICE trim order).
Idempotent seeder for three sample bots (Maya — coworker slow-burn,
Eli — live-in partner, Sam — bartender / new connection). Each is a
distinct relational archetype to exercise the system from different
angles. Run from repo root:

    .venv/bin/python scripts/seed_sample_bots.py

Re-running skips ids that already exist. After seeding, walk each bot
through kickoff parse-and-confirm at /bots/<id>/kickoff.
The kickoff parse-and-confirm route was 500-ing intermittently because
Hermes-3 + Featherless's response_format={"type":"json_object"} only
guarantees JSON output, NOT a particular schema. The model was inventing
its own field names (sceneTime, entities, settingDetails) instead of
the KickoffParse fields, causing Pydantic validation to fail on both
classify() retries.

Three changes:

1. Include the Pydantic JSON schema in the system prompt so the model
   knows exactly which keys to produce. Affects every classify() call
   (kickoff parse, turn parse, scene-close detect, significance,
   state-update, scene summarize). Strip ```json fences if the model
   wraps its output. Bump retries 2 → 3 (model is stochastic; one extra
   attempt closes most of the remaining gap).

2. parse_kickoff() now passes a default empty KickoffParse so the
   route degrades to a fillable form instead of 500 when the classifier
   ultimately fails. The confirm form is the human-in-the-loop; an
   empty form is strictly better UX than a stack trace.

3. Tests updated: bumped canned-failure arrays from 2 → 3 entries to
   match the new attempt count; renamed kickoff test from
   "raises_when_classifier_fails_twice" to
   "falls_back_to_empty_when_classifier_fails" reflecting the new
   degraded-but-usable behavior.

Verified live with all 3 sample bots (maya/eli/sam) — kickoff route
returns 200 across multiple attempts. Full suite: 168 passed.
Two related issues blocking real-world use of the kickoff parse:

1. Classifier calls take ~12s end-to-end on Featherless for the
   complex KickoffParse schema (Hermes-3-8B generating ~1.3KB of
   structured JSON). The 10s timeout was firing on most attempts,
   causing all 3 retries to time out and the empty-fallback to render
   with blank form values. Bumping the default
   classifier_timeout_s 10 → 30s gives generous headroom; measured
   p99 is ~13s, so 30s is comfortable.

2. Featherless caps concurrent connections per account (2 on free /
   lower paid tiers). Each turn flow can fire 4–5 calls (parse,
   scene-close detect, narrative stream, two state-update passes)
   plus the background significance worker. Without a gate, we'd
   exceed the cap and fail.

   Added a class-level ``asyncio.Semaphore`` to FeatherlessClient,
   shared across all instances, configured once in lifespan from
   ``Settings.featherless_max_concurrent`` (default 2). Both
   ``generate`` and ``stream`` acquire the semaphore for the duration
   of the call; the stream holds it until the async generator
   completes, so token streaming is correctly accounted for.

Verified live: 4/4 sequential kickoff parses for the same bot all
succeed with real parsed values (previously ~50% blank-fallback).
Full suite: 168 passed.
Empty submission was producing a blank user_turn event in the log and
firing the LLM stream anyway — the bot would invent a response from the
kickoff context alone, producing a monologue with no user input. Two-
layer fix:

- Browser: add `required` to the prose textarea in chat.html so the
  form refuses to submit empty.
- Server: 400 in post_turn when prose.strip() is empty. Defense in
  depth — if a client bypasses the textarea attribute (custom UI,
  curl, etc.), the server still rejects.

Verified live: POST with empty body returns 400; POST with whitespace-
only returns 400; chat shell renders the textarea with required.
Full suite: 168 passed.
The form-submit handler in chat.html was setting
``textarea.disabled = true`` synchronously before the browser actually
serialized the form. Disabled form fields are excluded from
submission, so the request body contained ``prose=""`` even when the
user had typed text — which the server (correctly) rejected with the
new empty-prose 400. Net effect: typing "hello" + Send gave a "prose
cannot be empty" error.

Switched to ``readOnly``: same UX (user can't edit while streaming)
but the field IS submitted. The unlock path now also clears the
textarea and refocuses for the next turn.
Bot replies were running long (4 paragraphs of action+dialogue beats
per turn) because we never set max_tokens on the narrative call. Three
tunable knobs now in Settings (set in data/config.toml to override):

- narrative_max_tokens: int = 400
  Hard cap on each generated response. ~400 tokens ≈ 1–2 short
  paragraphs. Drop to 200 for terse banter, bump to 800+ for longer
  scenes.

- narrative_temperature: float = 0.85
  Sampling temperature. 0.7 = grounded/consistent (slightly stiff),
  0.85 = creative-but-in-character (default), 1.0 = wide variety,
  >1.0 = often off-the-rails.

- prompt closing instruction now nudges: "Keep your response to a
  single beat — one or two short paragraphs at most. Don't monologue;
  leave room for the other person to react."

Both turns.py (post_turn) and regenerate.py forward the params to
client.stream(). FeatherlessClient already passes **params through to
the OpenAI-compat endpoint.

Note: temperature doesn't control length — that was a common
misconception. max_tokens is the actual length cap. Lower temperature
makes word choice more predictable (slightly stiffer voice), not
shorter. Both knobs are useful for different goals.
13 tasks across 6 waves (1, 2, 3, 4a, 4b, 5). Designed for parallel
subagent execution where file-disjointness allows.

Waves 1, 2, 4a, and 5 each contain 2-3 tasks that touch disjoint files
and can be dispatched concurrently via the Agent tool with
isolation: "worktree". Waves 3 (drawer guest support) and 4b (multi-
entity turn flow) are single-task because they touch hot files
(_drawer.html, turns.py) that cannot be safely co-modified.

Plan covers:
- T36: group_node schema + handlers (new migration 0008)
- T37: guest_added / guest_removed event handlers (modifies world.py)
- T38: relationship-seed service ("have they met?")
- T39: interjection classifier service
- T40: multi-entity state-update coordinator (6 directed pairs)
- T41: multi-witness memory write helper
- T42: drawer guest add/remove UI + render
- T43: multi-entity prompt assembly (extends T18)
- T44: multi-entity turn flow (rewrites post_turn)
- T45: multi-entity per-POV summaries on scene close
- T46: witness filter cross-coverage tests
- T47: bot_reset cascades to guest references
- T48: Phase 2 documentation update

Plan also documents:
- Worktree-per-subagent dispatch pattern using Agent isolation flag
- Merge ordering per wave (file-disjointness = conflict-free merges)
- Failure recovery (cancel failed parallel task, re-dispatch as solo)
- Conflict prevention checklist (verify Files sections disjoint per wave)

Tasks file (.tasks.json) carries dependency DAG with `blockedBy` and
`parallelGroup` so a future executing-plans run can dispatch correctly.

NOT EXECUTING. Plan only.
Rewrites post_turn for the multi-entity world:

- Addressee detection via case-insensitive whole-word match against the
  guest name; defaults to host on no-match or both-match.
- Multi-entity prompt assembly: forwards guest_id so the prompt sees
  the third party's activity / edges / group-node.
- Multi-witness memory write: record_turn_memory_for_present writes one
  memory per present bot witness when a guest is in the room.
- Multi-pair state-update: compute_state_updates_for_present emits one
  edge_update per directed pair (6 with a guest, 2 without).
- Interjection branch (T39): when a guest is present and the primary
  beat completes, the silent witness may follow on. detect_interjection
  decides; on True we stream a second narrative as the witness, append a
  second assistant_turn linked to the same user_turn_id, and re-run the
  multi-pair state update + memory write for the follow-on beat. Cancel
  collapses both halves; a cancelled interjection skips its downstream
  passes so we don't classifier-spam against a half-formed beat.
- Scene-close runs after both beats so apply_scene_close_summary sees
  the full closing scene; T45's guest-aware summarizer handles per-POV
  rewrites for each present witness.

regenerate.py mirrors the prompt / memory / state-update changes for
1:1 and multi-entity scenes. Per the Phase 2 spec, interjection
regeneration is deferred to Phase 2.5 — regenerate only re-streams the
addressee turn for v2.

Tests: adds 5 cases to tests/test_turn_flow.py covering the no-guest
regression, multi-bot without interjection, multi-bot with interjection,
scene-close per-POV rewrites, and addressee routing on a named-bot
prose. Each test pins its own canned MockLLMClient queue with the call
shape documented in the docstring.
dohertj2 added 1 commit 2026-04-26 16:55:55 -04:00
19 tasks across 8 waves covering events with lifecycles, time skips
(elision + jump), active threads, significance/retrieval refinements,
and meanwhile scenes (host+guest with no 'you'). Mirrors the Phase 2
plan structure: pre-flight, parallel-execution strategy with worktree
isolation, file-disjointness analysis per wave, and per-task TDD spec
with commit messages.

Phase 3 schema: adds 0009_events.sql, 0010_threads.sql,
0011_meanwhile_scenes.sql (final version 11). Builds on Phase 2's
3-entity scene support and event-sourced architecture.
dohertj2 added 1 commit 2026-04-26 17:02:49 -04:00
8 tasks across 5 waves consolidating the 15-item backlog tracked in
CLAUDE.md (5 from Phase 1.5 cleanup + 10 from Phase 2.5/3). Items are
grouped by file ownership so each wave stays file-disjoint:

- Wave 1 (parallel): open_db refactor, bot_reset orphan cleanup,
  LLM-merged group meta-summary
- Wave 2 (single): prompt.py polish — witness role parametric, single
  ACTIVITIES block, NICE trim documented
- Wave 3 (single): drawer polish — deferred v1 edits, first-meeting
  gate, witness flag editing
- Wave 4 (parallel): regenerate.py polish (SSE + interjection
  regenerate + stale-guest cleanup); turn-flow polish + new addressee
  service (classifier addressee + significance for interjection +
  scene-close-on-cancel pinned + stale-guest cleanup)
- Wave 5 (single): docs sweep

No schema migrations. Bundled tasks split into per-item sub-commits
for clean review bisection. Uses task ids T68-T75 to avoid collision
with Phase 3 plan (T49-T67) regardless of merge order.
dohertj2 merged commit 079774dce5 into main 2026-04-26 20:00:17 -04:00
Sign in to join this conversation.
No Reviewers
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: dohertj2/chat#2