Files

T

Joseph Doherty 3a81e540a1 chore: audit project() callers and non-idempotent handlers

Same defect class as 0f8bf94: routes that ``append_event`` then
``project(conn)`` 500 once any prior event makes the full-log replay
hit a raw-INSERT handler (chat_created, container_created,
scene_opened, memory_written, meanwhile_scene_started, etc.).

Fixes the two remaining live-path callers:
- chat/web/bots.py (bot_create) — bot_authored
- chat/web/settings.py (settings_post) — you_authored

Both swap ``append_event`` + ``project`` → ``append_and_apply`` so only
the new event is applied through its registered handler. Unused
imports of ``append_event`` and ``project`` removed from each file.

The rewind path (chat/services/rewind.py) intentionally calls
``project()`` after wiping every projected table — that's the
canonical "rebuild from log against an empty DB" entry point and is
left unchanged.

Inventory of every projector handler that uses raw INSERT
(chat_created, container_created, scene_opened, memory_written,
meanwhile_scene_started, meanwhile_digest_created, edge_update) is
documented with the trade-offs of why we don't blindly switch them to
INSERT OR REPLACE — for autoincrement-id rows there is no key to match
on, and for chat_created a lossy overwrite would silently clobber
chat_state mutations from later events. The handler layer stays
correctly non-idempotent under event-sourcing semantics; the rule is
enforced at the call site.

Adds a regression test (tests/test_chat_created_non_idempotent.py)
that pins the contract: appending two chat_created events for the same
id and then ``project()``ing a second time MUST raise
``IntegrityError`` on chats.id. Any future "make it idempotent" change
must update the test, forcing a deliberate review.

Suite: 471 passed in 11.82s (was 470 + this regression test).

Report: docs/audits/2026-04-27-project-callers.md

2026-04-27 14:51:49 -04:00

12 KiB

Raw Permalink Blame History

Audit: `project()` callers and non-idempotent projector handlers

Date: 2026-04-27 Triggering incident: commit 0f8bf94 — kickoff_post 500'd with sqlite3.IntegrityError: UNIQUE constraint failed: chats.id after a second bot's kickoff. Root cause: the route appended events with append_event() and then called project(conn), which replays the entire event log. The chat_created handler in chat/state/world.py uses raw INSERT INTO chats ... (no OR REPLACE/OR IGNORE), so on a DB that already had a first bot's chat row, the replay re-hit that row and raised.

This audit walks the rest of the live request paths to make sure no other route has the same shape, and inventories every projector handler that uses raw INSERT so the trade-offs are documented for future hardening passes.

Step 1 — `project()` callers

grep -rn "project(" chat/ --include="*.py" (excluding the definition itself in chat/eventlog/projector.py:17 and the local project_id type variables that the regex doesn't actually catch):

File:line	Caller	Classification
`chat/web/bots.py:113`	`bot_create` route — append `bot_authored` then `project(conn)`	Unsafe (live path) — fixed
`chat/web/settings.py:44`	`settings_post` route — append `you_authored` then `project(conn)`	Unsafe (live path) — fixed
`chat/services/rewind.py:110`	`execute_rewind` — clears every projected table then re-projects from the truncated log	Safe (replay-only)
`chat/eventlog/projector.py:17`	Definition site, not a call	n/a
`tests/test_*.py` (~50 tests)	Test setup pattern: append a sequence of synthetic events into a fresh DB, then `project(conn)` to materialise	Safe (replay-only) — projects against an empty/fresh DB; not a live request path

Safe (replay-only)

chat/services/rewind.py:110 — execute_rewind is the canonical "rebuild the projection" entry point. Lines 95–104 explicitly DELETE FROM every projected table (memories, activity, scenes, containers, chat_state, chats, edges, bots, you_entity, classifier_failures) before calling project(conn). The handler registry then walks the truncated log against empty tables, so even the raw-INSERT handlers run safely on a clean slate. The module docstring (lines 1–21) calls out exactly why a full replay (rather than a "revert delta") is the right move here: the edge_update handler is a delta accumulator with no clean inverse. Do not change.
Test suite — every from chat.eventlog.projector import project in tests/ is a setup helper. They open a fresh in-memory or tmp-path DB, append a hand-crafted sequence of events, and call project(conn) once. There is no second-replay risk because the DB starts empty. These are not live paths.

Unsafe (live-path) — fixed in this audit

Both fixes follow the pattern established by 0f8bf94: drop the append_event + project pair in favour of append_and_apply (defined in chat/eventlog/log.py:32), which appends and runs only the brand-new event through its registered handler.

chat/web/bots.py:113 — bot_create Was: append_event(conn, kind="bot_authored", ...); project(conn). Now: append_and_apply(conn, kind="bot_authored", ...). In isolation, _apply_bot_authored is itself idempotent (INSERT OR REPLACE INTO bots), so the route didn't fail today. The bug is latent: as soon as any kickoff ran first (which produces chat_created events), the next call to bot_create would replay that prior chat_created and trip the same UNIQUE constraint. We saw this happen in 0f8bf94 — fixing the symmetric route prevents the next variant of the same incident. Removed unused imports: append_event, project.
chat/web/settings.py:44 — settings_post Was: append_event(conn, kind="you_authored", ...); project(conn). Now: append_and_apply(conn, kind="you_authored", ...). Same shape as bot_create. _apply_you_authored is idempotent on its own (INSERT OR REPLACE INTO you_entity), but project() walks the whole log, including any chat_created / container_created / scene_opened events that have accumulated. Editing the user's own settings on a non-empty DB would 500 with the same UNIQUE constraint error — not because the new event is unsafe, but because the replay is. Fixed by per-event apply. Removed unused imports: append_event, project.

Unsafe — still to fix

None. The two unsafe live-path call sites identified above were both fixed in this commit. Future hardening: a CI lint that flags project( outside chat/services/rewind.py and tests/ would catch a regression, but that's out of scope here.

Step 2 — non-idempotent projector handler inventory

Output of grep -n "INSERT INTO\|INSERT OR REPLACE\|INSERT OR IGNORE" chat/state/*.py, classified.

Replay-safe handlers

These either use INSERT OR REPLACE / INSERT OR IGNORE (so a second apply is a no-op or an overwrite of identical data), or are pure UPDATE against rows the prior event created.

Handler	File	Statement	Why safe
`_apply_bot_authored`	`chat/state/entities.py:12`	`INSERT OR REPLACE INTO bots`	`id` is the natural PK; replay overwrites with identical payload.
`_apply_you_authored`	`chat/state/entities.py:29`	`INSERT OR REPLACE INTO you_entity`	Singleton row keyed on `id=1`.
`_apply_activity_change`	`chat/state/world.py:98`	`INSERT OR REPLACE INTO activity`	Activity is keyed on `entity_id` — last write wins is exactly the intended semantics.
`_apply_thread_opened`	`chat/state/threads.py:12`	`INSERT OR IGNORE INTO threads`	`thread_id` is the natural PK.
`_apply_event_planned`	`chat/state/events.py:16`	`INSERT OR IGNORE INTO events`	`event_id` is the natural PK.
`_apply_branch_created`	`chat/state/branches.py:27`	`INSERT OR IGNORE INTO branches`	Branch `name` is unique.
`_apply_group_node_initialized`	`chat/state/group_node.py:12`	`INSERT OR REPLACE INTO group_node`	One row per `chat_id`.
`_apply_embedding_indexed`	`chat/state/embeddings.py:28`	`INSERT OR REPLACE INTO embeddings`	One vector per `memory_id`.
Pure-`UPDATE` handlers	various — `_apply_time_skip_*`, `_apply_guest_added`/`_removed`, `_apply_scene_closed`, `_apply_memory_significance_set`, `_apply_memory_pin_changed`, `_apply_meanwhile_scene_closed`, `_apply_meanwhile_digest_consumed`, `_apply_thread_updated`, `_apply_event_started`/`_completed`/`_cancelled` (etc.), `_apply_group_node_updated`	n/a	Idempotent: re-applying the same UPDATE produces the same row state.

Unsafe-on-replay handlers (raw `INSERT`)

Handler	File	Statement	Failure mode on replay
`_apply_chat_created`	`chat/state/world.py:14`	`INSERT INTO chats`, `INSERT INTO chat_state`	`chats.id` is PK — second insert raises `IntegrityError: UNIQUE constraint failed: chats.id`. This is the `0f8bf94` bug. `chat_state.chat_id` is also unique; would raise too.
`_apply_container_created`	`chat/state/world.py:78`	`INSERT INTO containers`	`containers.id` is `INTEGER PRIMARY KEY AUTOINCREMENT` — replay does NOT raise (a new id is assigned), but it silently creates a duplicate row, fragmenting downstream lookups by `(chat_id, name)`. Silent corruption, not a crash.
`_apply_scene_opened`	`chat/state/world.py:115`	`INSERT INTO scenes`	Same shape: autoincrement `id`. Replay creates a duplicate scene row and re-points `chat_state.active_scene_id` to the new copy. Silent corruption.
`_apply_memory_written`	`chat/state/memory.py:14`	`INSERT INTO memories`	Autoincrement `id`. Replay duplicates the memory; FTS5 trigger then double-indexes the same `pov_summary`. Silent corruption + double-counting in retrieval.
`_apply_meanwhile_scene_started`	`chat/state/meanwhile.py:29`	`INSERT INTO scenes` (with explicit `scene_id`)	Caller supplies `scene_id` (deterministic). Replay raises `IntegrityError: UNIQUE constraint failed: scenes.id`. Hard crash, like `chat_created`.
`_apply_meanwhile_digest_created`	`chat/state/meanwhile.py:67`	`INSERT INTO meanwhile_digest_pending`	Autoincrement `id`. Replay creates a duplicate pending digest, surfacing the same summary twice in the next you-scene's prompt. Silent corruption.
`_apply_edge_update`	`chat/state/edges.py:12`	`INSERT OR IGNORE INTO edges` followed by `UPDATE … SET affinity = ? + delta`	The `INSERT OR IGNORE` is fine, but the handler is delta-shaped — each replay re-adds `affinity_delta` and `trust_delta`, and re-extends `knowledge_json`. Silent corruption: scores drift up; knowledge facts duplicate. Already called out in `chat/eventlog/log.py:39-46` as the canonical reason `append_and_apply` exists.

Trade-offs — why we are NOT switching every handler to `INSERT OR REPLACE`

This is the part the audit is here to nail down before someone "fixes it" with a one-line s/INSERT INTO/INSERT OR REPLACE INTO/.

Autoincrement-id handlers (containers, scenes, memories, meanwhile_digest_pending) — INSERT OR REPLACE doesn't help. Each event's payload doesn't carry the row's eventual id — the id comes from lastrowid at projection time. There is no key for OR REPLACE to match on. The fix here is either (a) make the event carry a deterministic id derived from the event's own id (large refactor — payload schemas, downstream FK lookups, FTS rowid alignment), or (b) keep the handler raw-INSERT and ensure every live path uses append_and_apply (the path we're on). We are on path (b), and this audit makes it explicit.
chat_created — chats.id IS keyed on the natural PK, so INSERT OR REPLACE INTO chats ... would technically work for the chat row. But it would silently overwrite chat_state columns that other events legitimately mutate later: chat_state.time is bumped by time_skip_elision, active_scene_id is set/cleared by scene_opened/scene_closed. On replay the chat_created overwrite would clobber those subsequent updates, then later events would re-set them — if the events themselves appear in order (they do today). It would work in practice, but it would erase the invariant that "each handler is responsible for one table-shape change" and make the projector's correctness depend on strict event-order replay through chat_state. Not worth the subtle coupling; keep the raw INSERT and treat replay as an explicit "wipe + replay" operation (the rewind path does exactly that).
meanwhile_scene_started — could be made idempotent (the payload supplies scene_id), but it shares the scenes table with _apply_scene_opened (autoincrement) — making one half of the table writers OR REPLACE and the other half raw-INSERT is asking for a future bug. Keep both raw, lean on append_and_apply.
edge_update — fundamentally cannot be made idempotent under replay without either changing the event schema (carry absolute values, not deltas) or recording per-event-id "already applied" flags. Either is a multi-week project. The current contract is "edge_update is a delta event; never apply it twice"; the append_and_apply rule enforces that contract from the call site.

Conclusion: the handler layer is correctly non-idempotent for event-sourcing semantics. The defect class lives in the caller layer (routes that mistakenly call project() instead of append_and_apply). This audit fixes the two known offenders and pins the contract with a regression test (see Step 3).

Step 3 — regression test

Added tests/test_chat_created_non_idempotent.py. The test:

Opens a fresh DB and runs the migration chain.
Appends one chat_created event and projects — first projection succeeds.
Appends a second chat_created for the same chat id and projects again — asserts that the second projection raises sqlite3.IntegrityError.

The point isn't that the test catches a future "make it idempotent" change automatically; it's that any such change MUST update this test, forcing a deliberate review of all the trade-offs documented above.

Files changed

chat/web/bots.py — swap append_event+project → append_and_apply, drop unused imports.
chat/web/settings.py — same swap.
tests/test_chat_created_non_idempotent.py — new regression test.
docs/audits/2026-04-27-project-callers.md — this file.

12 KiB Raw Permalink Blame History Unescape Escape

Audit: project() callers and non-idempotent projector handlers

Step 1 — project() callers