Files
chat/docs/audits/2026-04-27-project-callers.md
Joseph Doherty 3a81e540a1 chore: audit project() callers and non-idempotent handlers
Same defect class as 0f8bf94: routes that ``append_event`` then
``project(conn)`` 500 once any prior event makes the full-log replay
hit a raw-INSERT handler (chat_created, container_created,
scene_opened, memory_written, meanwhile_scene_started, etc.).

Fixes the two remaining live-path callers:
- chat/web/bots.py (bot_create) — bot_authored
- chat/web/settings.py (settings_post) — you_authored

Both swap ``append_event`` + ``project`` → ``append_and_apply`` so only
the new event is applied through its registered handler. Unused
imports of ``append_event`` and ``project`` removed from each file.

The rewind path (chat/services/rewind.py) intentionally calls
``project()`` after wiping every projected table — that's the
canonical "rebuild from log against an empty DB" entry point and is
left unchanged.

Inventory of every projector handler that uses raw INSERT
(chat_created, container_created, scene_opened, memory_written,
meanwhile_scene_started, meanwhile_digest_created, edge_update) is
documented with the trade-offs of why we don't blindly switch them to
INSERT OR REPLACE — for autoincrement-id rows there is no key to match
on, and for chat_created a lossy overwrite would silently clobber
chat_state mutations from later events. The handler layer stays
correctly non-idempotent under event-sourcing semantics; the rule is
enforced at the call site.

Adds a regression test (tests/test_chat_created_non_idempotent.py)
that pins the contract: appending two chat_created events for the same
id and then ``project()``ing a second time MUST raise
``IntegrityError`` on chats.id. Any future "make it idempotent" change
must update the test, forcing a deliberate review.

Suite: 471 passed in 11.82s (was 470 + this regression test).

Report: docs/audits/2026-04-27-project-callers.md
2026-04-27 14:51:49 -04:00

12 KiB
Raw Permalink Blame History

Audit: project() callers and non-idempotent projector handlers

Date: 2026-04-27 Triggering incident: commit 0f8bf94kickoff_post 500'd with sqlite3.IntegrityError: UNIQUE constraint failed: chats.id after a second bot's kickoff. Root cause: the route appended events with append_event() and then called project(conn), which replays the entire event log. The chat_created handler in chat/state/world.py uses raw INSERT INTO chats ... (no OR REPLACE/OR IGNORE), so on a DB that already had a first bot's chat row, the replay re-hit that row and raised.

This audit walks the rest of the live request paths to make sure no other route has the same shape, and inventories every projector handler that uses raw INSERT so the trade-offs are documented for future hardening passes.


Step 1 — project() callers

grep -rn "project(" chat/ --include="*.py" (excluding the definition itself in chat/eventlog/projector.py:17 and the local project_id type variables that the regex doesn't actually catch):

File:line Caller Classification
chat/web/bots.py:113 bot_create route — append bot_authored then project(conn) Unsafe (live path) — fixed
chat/web/settings.py:44 settings_post route — append you_authored then project(conn) Unsafe (live path) — fixed
chat/services/rewind.py:110 execute_rewind — clears every projected table then re-projects from the truncated log Safe (replay-only)
chat/eventlog/projector.py:17 Definition site, not a call n/a
tests/test_*.py (~50 tests) Test setup pattern: append a sequence of synthetic events into a fresh DB, then project(conn) to materialise Safe (replay-only) — projects against an empty/fresh DB; not a live request path

Safe (replay-only)

  • chat/services/rewind.py:110execute_rewind is the canonical "rebuild the projection" entry point. Lines 95104 explicitly DELETE FROM every projected table (memories, activity, scenes, containers, chat_state, chats, edges, bots, you_entity, classifier_failures) before calling project(conn). The handler registry then walks the truncated log against empty tables, so even the raw-INSERT handlers run safely on a clean slate. The module docstring (lines 121) calls out exactly why a full replay (rather than a "revert delta") is the right move here: the edge_update handler is a delta accumulator with no clean inverse. Do not change.

  • Test suite — every from chat.eventlog.projector import project in tests/ is a setup helper. They open a fresh in-memory or tmp-path DB, append a hand-crafted sequence of events, and call project(conn) once. There is no second-replay risk because the DB starts empty. These are not live paths.

Unsafe (live-path) — fixed in this audit

Both fixes follow the pattern established by 0f8bf94: drop the append_event + project pair in favour of append_and_apply (defined in chat/eventlog/log.py:32), which appends and runs only the brand-new event through its registered handler.

  • chat/web/bots.py:113bot_create Was: append_event(conn, kind="bot_authored", ...); project(conn). Now: append_and_apply(conn, kind="bot_authored", ...). In isolation, _apply_bot_authored is itself idempotent (INSERT OR REPLACE INTO bots), so the route didn't fail today. The bug is latent: as soon as any kickoff ran first (which produces chat_created events), the next call to bot_create would replay that prior chat_created and trip the same UNIQUE constraint. We saw this happen in 0f8bf94 — fixing the symmetric route prevents the next variant of the same incident. Removed unused imports: append_event, project.

  • chat/web/settings.py:44settings_post Was: append_event(conn, kind="you_authored", ...); project(conn). Now: append_and_apply(conn, kind="you_authored", ...). Same shape as bot_create. _apply_you_authored is idempotent on its own (INSERT OR REPLACE INTO you_entity), but project() walks the whole log, including any chat_created / container_created / scene_opened events that have accumulated. Editing the user's own settings on a non-empty DB would 500 with the same UNIQUE constraint error — not because the new event is unsafe, but because the replay is. Fixed by per-event apply. Removed unused imports: append_event, project.

Unsafe — still to fix

None. The two unsafe live-path call sites identified above were both fixed in this commit. Future hardening: a CI lint that flags project( outside chat/services/rewind.py and tests/ would catch a regression, but that's out of scope here.


Step 2 — non-idempotent projector handler inventory

Output of grep -n "INSERT INTO\|INSERT OR REPLACE\|INSERT OR IGNORE" chat/state/*.py, classified.

Replay-safe handlers

These either use INSERT OR REPLACE / INSERT OR IGNORE (so a second apply is a no-op or an overwrite of identical data), or are pure UPDATE against rows the prior event created.

Handler File Statement Why safe
_apply_bot_authored chat/state/entities.py:12 INSERT OR REPLACE INTO bots id is the natural PK; replay overwrites with identical payload.
_apply_you_authored chat/state/entities.py:29 INSERT OR REPLACE INTO you_entity Singleton row keyed on id=1.
_apply_activity_change chat/state/world.py:98 INSERT OR REPLACE INTO activity Activity is keyed on entity_id — last write wins is exactly the intended semantics.
_apply_thread_opened chat/state/threads.py:12 INSERT OR IGNORE INTO threads thread_id is the natural PK.
_apply_event_planned chat/state/events.py:16 INSERT OR IGNORE INTO events event_id is the natural PK.
_apply_branch_created chat/state/branches.py:27 INSERT OR IGNORE INTO branches Branch name is unique.
_apply_group_node_initialized chat/state/group_node.py:12 INSERT OR REPLACE INTO group_node One row per chat_id.
_apply_embedding_indexed chat/state/embeddings.py:28 INSERT OR REPLACE INTO embeddings One vector per memory_id.
Pure-UPDATE handlers various — _apply_time_skip_*, _apply_guest_added/_removed, _apply_scene_closed, _apply_memory_significance_set, _apply_memory_pin_changed, _apply_meanwhile_scene_closed, _apply_meanwhile_digest_consumed, _apply_thread_updated, _apply_event_started/_completed/_cancelled (etc.), _apply_group_node_updated n/a Idempotent: re-applying the same UPDATE produces the same row state.

Unsafe-on-replay handlers (raw INSERT)

Handler File Statement Failure mode on replay
_apply_chat_created chat/state/world.py:14 INSERT INTO chats, INSERT INTO chat_state chats.id is PK — second insert raises IntegrityError: UNIQUE constraint failed: chats.id. This is the 0f8bf94 bug. chat_state.chat_id is also unique; would raise too.
_apply_container_created chat/state/world.py:78 INSERT INTO containers containers.id is INTEGER PRIMARY KEY AUTOINCREMENT — replay does NOT raise (a new id is assigned), but it silently creates a duplicate row, fragmenting downstream lookups by (chat_id, name). Silent corruption, not a crash.
_apply_scene_opened chat/state/world.py:115 INSERT INTO scenes Same shape: autoincrement id. Replay creates a duplicate scene row and re-points chat_state.active_scene_id to the new copy. Silent corruption.
_apply_memory_written chat/state/memory.py:14 INSERT INTO memories Autoincrement id. Replay duplicates the memory; FTS5 trigger then double-indexes the same pov_summary. Silent corruption + double-counting in retrieval.
_apply_meanwhile_scene_started chat/state/meanwhile.py:29 INSERT INTO scenes (with explicit scene_id) Caller supplies scene_id (deterministic). Replay raises IntegrityError: UNIQUE constraint failed: scenes.id. Hard crash, like chat_created.
_apply_meanwhile_digest_created chat/state/meanwhile.py:67 INSERT INTO meanwhile_digest_pending Autoincrement id. Replay creates a duplicate pending digest, surfacing the same summary twice in the next you-scene's prompt. Silent corruption.
_apply_edge_update chat/state/edges.py:12 INSERT OR IGNORE INTO edges followed by UPDATE … SET affinity = ? + delta The INSERT OR IGNORE is fine, but the handler is delta-shaped — each replay re-adds affinity_delta and trust_delta, and re-extends knowledge_json. Silent corruption: scores drift up; knowledge facts duplicate. Already called out in chat/eventlog/log.py:39-46 as the canonical reason append_and_apply exists.

Trade-offs — why we are NOT switching every handler to INSERT OR REPLACE

This is the part the audit is here to nail down before someone "fixes it" with a one-line s/INSERT INTO/INSERT OR REPLACE INTO/.

  1. Autoincrement-id handlers (containers, scenes, memories, meanwhile_digest_pending)INSERT OR REPLACE doesn't help. Each event's payload doesn't carry the row's eventual id — the id comes from lastrowid at projection time. There is no key for OR REPLACE to match on. The fix here is either (a) make the event carry a deterministic id derived from the event's own id (large refactor — payload schemas, downstream FK lookups, FTS rowid alignment), or (b) keep the handler raw-INSERT and ensure every live path uses append_and_apply (the path we're on). We are on path (b), and this audit makes it explicit.

  2. chat_createdchats.id IS keyed on the natural PK, so INSERT OR REPLACE INTO chats ... would technically work for the chat row. But it would silently overwrite chat_state columns that other events legitimately mutate later: chat_state.time is bumped by time_skip_elision, active_scene_id is set/cleared by scene_opened/scene_closed. On replay the chat_created overwrite would clobber those subsequent updates, then later events would re-set them — if the events themselves appear in order (they do today). It would work in practice, but it would erase the invariant that "each handler is responsible for one table-shape change" and make the projector's correctness depend on strict event-order replay through chat_state. Not worth the subtle coupling; keep the raw INSERT and treat replay as an explicit "wipe + replay" operation (the rewind path does exactly that).

  3. meanwhile_scene_started — could be made idempotent (the payload supplies scene_id), but it shares the scenes table with _apply_scene_opened (autoincrement) — making one half of the table writers OR REPLACE and the other half raw-INSERT is asking for a future bug. Keep both raw, lean on append_and_apply.

  4. edge_update — fundamentally cannot be made idempotent under replay without either changing the event schema (carry absolute values, not deltas) or recording per-event-id "already applied" flags. Either is a multi-week project. The current contract is "edge_update is a delta event; never apply it twice"; the append_and_apply rule enforces that contract from the call site.

Conclusion: the handler layer is correctly non-idempotent for event-sourcing semantics. The defect class lives in the caller layer (routes that mistakenly call project() instead of append_and_apply). This audit fixes the two known offenders and pins the contract with a regression test (see Step 3).


Step 3 — regression test

Added tests/test_chat_created_non_idempotent.py. The test:

  1. Opens a fresh DB and runs the migration chain.
  2. Appends one chat_created event and projects — first projection succeeds.
  3. Appends a second chat_created for the same chat id and projects again — asserts that the second projection raises sqlite3.IntegrityError.

The point isn't that the test catches a future "make it idempotent" change automatically; it's that any such change MUST update this test, forcing a deliberate review of all the trade-offs documented above.


Files changed

  • chat/web/bots.py — swap append_event+projectappend_and_apply, drop unused imports.
  • chat/web/settings.py — same swap.
  • tests/test_chat_created_non_idempotent.py — new regression test.
  • docs/audits/2026-04-27-project-callers.md — this file.