chore: audit project() callers and non-idempotent handlers
Same defect class as 0f8bf94: routes that ``append_event`` then
``project(conn)`` 500 once any prior event makes the full-log replay
hit a raw-INSERT handler (chat_created, container_created,
scene_opened, memory_written, meanwhile_scene_started, etc.).
Fixes the two remaining live-path callers:
- chat/web/bots.py (bot_create) — bot_authored
- chat/web/settings.py (settings_post) — you_authored
Both swap ``append_event`` + ``project`` → ``append_and_apply`` so only
the new event is applied through its registered handler. Unused
imports of ``append_event`` and ``project`` removed from each file.
The rewind path (chat/services/rewind.py) intentionally calls
``project()`` after wiping every projected table — that's the
canonical "rebuild from log against an empty DB" entry point and is
left unchanged.
Inventory of every projector handler that uses raw INSERT
(chat_created, container_created, scene_opened, memory_written,
meanwhile_scene_started, meanwhile_digest_created, edge_update) is
documented with the trade-offs of why we don't blindly switch them to
INSERT OR REPLACE — for autoincrement-id rows there is no key to match
on, and for chat_created a lossy overwrite would silently clobber
chat_state mutations from later events. The handler layer stays
correctly non-idempotent under event-sourcing semantics; the rule is
enforced at the call site.
Adds a regression test (tests/test_chat_created_non_idempotent.py)
that pins the contract: appending two chat_created events for the same
id and then ``project()``ing a second time MUST raise
``IntegrityError`` on chats.id. Any future "make it idempotent" change
must update the test, forcing a deliberate review.
Suite: 471 passed in 11.82s (was 470 + this regression test).
Report: docs/audits/2026-04-27-project-callers.md
This commit is contained in:
@@ -0,0 +1,205 @@
|
||||
# Audit: `project()` callers and non-idempotent projector handlers
|
||||
|
||||
**Date:** 2026-04-27
|
||||
**Triggering incident:** commit `0f8bf94` — `kickoff_post` 500'd with
|
||||
`sqlite3.IntegrityError: UNIQUE constraint failed: chats.id` after a
|
||||
second bot's kickoff. Root cause: the route appended events with
|
||||
`append_event()` and then called `project(conn)`, which replays the
|
||||
*entire* event log. The `chat_created` handler in `chat/state/world.py`
|
||||
uses raw `INSERT INTO chats ...` (no `OR REPLACE`/`OR IGNORE`), so on a
|
||||
DB that already had a first bot's chat row, the replay re-hit that row
|
||||
and raised.
|
||||
|
||||
This audit walks the rest of the live request paths to make sure no
|
||||
other route has the same shape, and inventories every projector handler
|
||||
that uses raw `INSERT` so the trade-offs are documented for future
|
||||
hardening passes.
|
||||
|
||||
---
|
||||
|
||||
## Step 1 — `project()` callers
|
||||
|
||||
`grep -rn "project(" chat/ --include="*.py"` (excluding the definition
|
||||
itself in `chat/eventlog/projector.py:17` and the local `project_id`
|
||||
type variables that the regex doesn't actually catch):
|
||||
|
||||
| File:line | Caller | Classification |
|
||||
|---|---|---|
|
||||
| `chat/web/bots.py:113` | `bot_create` route — append `bot_authored` then `project(conn)` | **Unsafe (live path) — fixed** |
|
||||
| `chat/web/settings.py:44` | `settings_post` route — append `you_authored` then `project(conn)` | **Unsafe (live path) — fixed** |
|
||||
| `chat/services/rewind.py:110` | `execute_rewind` — clears every projected table then re-projects from the truncated log | **Safe (replay-only)** |
|
||||
| `chat/eventlog/projector.py:17` | Definition site, not a call | n/a |
|
||||
| `tests/test_*.py` (~50 tests) | Test setup pattern: append a sequence of synthetic events into a fresh DB, then `project(conn)` to materialise | **Safe (replay-only)** — projects against an empty/fresh DB; not a live request path |
|
||||
|
||||
### Safe (replay-only)
|
||||
|
||||
- **`chat/services/rewind.py:110`** — `execute_rewind` is the canonical
|
||||
"rebuild the projection" entry point. Lines 95–104 explicitly
|
||||
`DELETE FROM` every projected table (`memories`, `activity`, `scenes`,
|
||||
`containers`, `chat_state`, `chats`, `edges`, `bots`, `you_entity`,
|
||||
`classifier_failures`) before calling `project(conn)`. The handler
|
||||
registry then walks the truncated log against empty tables, so even
|
||||
the raw-INSERT handlers run safely on a clean slate. The module
|
||||
docstring (lines 1–21) calls out exactly why a full replay (rather
|
||||
than a "revert delta") is the right move here: the `edge_update`
|
||||
handler is a delta accumulator with no clean inverse. **Do not
|
||||
change.**
|
||||
|
||||
- **Test suite** — every `from chat.eventlog.projector import project`
|
||||
in `tests/` is a setup helper. They open a fresh in-memory or
|
||||
tmp-path DB, append a hand-crafted sequence of events, and call
|
||||
`project(conn)` once. There is no second-replay risk because the DB
|
||||
starts empty. These are not live paths.
|
||||
|
||||
### Unsafe (live-path) — fixed in this audit
|
||||
|
||||
Both fixes follow the pattern established by `0f8bf94`: drop the
|
||||
`append_event` + `project` pair in favour of `append_and_apply` (defined
|
||||
in `chat/eventlog/log.py:32`), which appends and runs *only the
|
||||
brand-new event* through its registered handler.
|
||||
|
||||
- **`chat/web/bots.py:113` — `bot_create`**
|
||||
Was: `append_event(conn, kind="bot_authored", ...); project(conn)`.
|
||||
Now: `append_and_apply(conn, kind="bot_authored", ...)`.
|
||||
In isolation, `_apply_bot_authored` is itself idempotent (`INSERT OR
|
||||
REPLACE INTO bots`), so the *route* didn't fail today. The bug is
|
||||
latent: as soon as any kickoff ran first (which produces
|
||||
`chat_created` events), the next call to `bot_create` would replay
|
||||
that prior `chat_created` and trip the same UNIQUE constraint. We
|
||||
saw this happen in `0f8bf94` — fixing the symmetric route prevents
|
||||
the next variant of the same incident.
|
||||
Removed unused imports: `append_event`, `project`.
|
||||
|
||||
- **`chat/web/settings.py:44` — `settings_post`**
|
||||
Was: `append_event(conn, kind="you_authored", ...); project(conn)`.
|
||||
Now: `append_and_apply(conn, kind="you_authored", ...)`.
|
||||
Same shape as `bot_create`. `_apply_you_authored` is idempotent on
|
||||
its own (`INSERT OR REPLACE INTO you_entity`), but `project()` walks
|
||||
the *whole* log, including any `chat_created` / `container_created`
|
||||
/ `scene_opened` events that have accumulated. Editing the user's
|
||||
own settings on a non-empty DB would 500 with the same UNIQUE
|
||||
constraint error — not because the new event is unsafe, but because
|
||||
the replay is. Fixed by per-event apply.
|
||||
Removed unused imports: `append_event`, `project`.
|
||||
|
||||
### Unsafe — still to fix
|
||||
|
||||
None. The two unsafe live-path call sites identified above were both
|
||||
fixed in this commit. Future hardening: a CI lint that flags
|
||||
`project(` outside `chat/services/rewind.py` and `tests/` would catch a
|
||||
regression, but that's out of scope here.
|
||||
|
||||
---
|
||||
|
||||
## Step 2 — non-idempotent projector handler inventory
|
||||
|
||||
Output of `grep -n "INSERT INTO\|INSERT OR REPLACE\|INSERT OR IGNORE"
|
||||
chat/state/*.py`, classified.
|
||||
|
||||
### Replay-safe handlers
|
||||
|
||||
These either use `INSERT OR REPLACE` / `INSERT OR IGNORE` (so a second
|
||||
apply is a no-op or an overwrite of identical data), or are pure
|
||||
`UPDATE` against rows the prior event created.
|
||||
|
||||
| Handler | File | Statement | Why safe |
|
||||
|---|---|---|---|
|
||||
| `_apply_bot_authored` | `chat/state/entities.py:12` | `INSERT OR REPLACE INTO bots` | `id` is the natural PK; replay overwrites with identical payload. |
|
||||
| `_apply_you_authored` | `chat/state/entities.py:29` | `INSERT OR REPLACE INTO you_entity` | Singleton row keyed on `id=1`. |
|
||||
| `_apply_activity_change` | `chat/state/world.py:98` | `INSERT OR REPLACE INTO activity` | Activity is keyed on `entity_id` — last write wins is exactly the intended semantics. |
|
||||
| `_apply_thread_opened` | `chat/state/threads.py:12` | `INSERT OR IGNORE INTO threads` | `thread_id` is the natural PK. |
|
||||
| `_apply_event_planned` | `chat/state/events.py:16` | `INSERT OR IGNORE INTO events` | `event_id` is the natural PK. |
|
||||
| `_apply_branch_created` | `chat/state/branches.py:27` | `INSERT OR IGNORE INTO branches` | Branch `name` is unique. |
|
||||
| `_apply_group_node_initialized` | `chat/state/group_node.py:12` | `INSERT OR REPLACE INTO group_node` | One row per `chat_id`. |
|
||||
| `_apply_embedding_indexed` | `chat/state/embeddings.py:28` | `INSERT OR REPLACE INTO embeddings` | One vector per `memory_id`. |
|
||||
| Pure-`UPDATE` handlers | various — `_apply_time_skip_*`, `_apply_guest_added`/`_removed`, `_apply_scene_closed`, `_apply_memory_significance_set`, `_apply_memory_pin_changed`, `_apply_meanwhile_scene_closed`, `_apply_meanwhile_digest_consumed`, `_apply_thread_updated`, `_apply_event_started`/`_completed`/`_cancelled` (etc.), `_apply_group_node_updated` | n/a | Idempotent: re-applying the same UPDATE produces the same row state. |
|
||||
|
||||
### Unsafe-on-replay handlers (raw `INSERT`)
|
||||
|
||||
| Handler | File | Statement | Failure mode on replay |
|
||||
|---|---|---|---|
|
||||
| `_apply_chat_created` | `chat/state/world.py:14` | `INSERT INTO chats`, `INSERT INTO chat_state` | `chats.id` is PK — second insert raises `IntegrityError: UNIQUE constraint failed: chats.id`. **This is the `0f8bf94` bug.** `chat_state.chat_id` is also unique; would raise too. |
|
||||
| `_apply_container_created` | `chat/state/world.py:78` | `INSERT INTO containers` | `containers.id` is `INTEGER PRIMARY KEY AUTOINCREMENT` — replay does NOT raise (a new id is assigned), but it silently creates a duplicate row, fragmenting downstream lookups by `(chat_id, name)`. **Silent corruption, not a crash.** |
|
||||
| `_apply_scene_opened` | `chat/state/world.py:115` | `INSERT INTO scenes` | Same shape: autoincrement `id`. Replay creates a duplicate scene row and re-points `chat_state.active_scene_id` to the new copy. **Silent corruption.** |
|
||||
| `_apply_memory_written` | `chat/state/memory.py:14` | `INSERT INTO memories` | Autoincrement `id`. Replay duplicates the memory; FTS5 trigger then double-indexes the same `pov_summary`. **Silent corruption + double-counting in retrieval.** |
|
||||
| `_apply_meanwhile_scene_started` | `chat/state/meanwhile.py:29` | `INSERT INTO scenes` (with explicit `scene_id`) | Caller supplies `scene_id` (deterministic). Replay raises `IntegrityError: UNIQUE constraint failed: scenes.id`. **Hard crash, like `chat_created`.** |
|
||||
| `_apply_meanwhile_digest_created` | `chat/state/meanwhile.py:67` | `INSERT INTO meanwhile_digest_pending` | Autoincrement `id`. Replay creates a duplicate pending digest, surfacing the same summary twice in the next you-scene's prompt. **Silent corruption.** |
|
||||
| `_apply_edge_update` | `chat/state/edges.py:12` | `INSERT OR IGNORE INTO edges` followed by `UPDATE … SET affinity = ? + delta` | The `INSERT OR IGNORE` is fine, but the handler is *delta-shaped* — each replay re-adds `affinity_delta` and `trust_delta`, and re-extends `knowledge_json`. **Silent corruption: scores drift up; knowledge facts duplicate.** Already called out in `chat/eventlog/log.py:39-46` as the canonical reason `append_and_apply` exists. |
|
||||
|
||||
### Trade-offs — why we are NOT switching every handler to `INSERT OR REPLACE`
|
||||
|
||||
This is the part the audit is here to nail down before someone "fixes
|
||||
it" with a one-line s/`INSERT INTO`/`INSERT OR REPLACE INTO`/.
|
||||
|
||||
1. **Autoincrement-id handlers (`containers`, `scenes`, `memories`,
|
||||
`meanwhile_digest_pending`)** — `INSERT OR REPLACE` doesn't help.
|
||||
Each event's payload doesn't carry the row's eventual id — the id
|
||||
comes from `lastrowid` *at projection time*. There is no key for
|
||||
`OR REPLACE` to match on. The fix here is either (a) make the event
|
||||
carry a deterministic id derived from the event's own id (large
|
||||
refactor — payload schemas, downstream FK lookups, FTS rowid
|
||||
alignment), or (b) keep the handler raw-INSERT and ensure every
|
||||
live path uses `append_and_apply` (the path we're on). We are on
|
||||
path (b), and this audit makes it explicit.
|
||||
|
||||
2. **`chat_created`** — `chats.id` IS keyed on the natural PK, so
|
||||
`INSERT OR REPLACE INTO chats ...` would technically work for the
|
||||
chat row. *But* it would silently overwrite `chat_state` columns
|
||||
that other events legitimately mutate later: `chat_state.time` is
|
||||
bumped by `time_skip_elision`, `active_scene_id` is set/cleared by
|
||||
`scene_opened`/`scene_closed`. On replay the
|
||||
`chat_created` overwrite would clobber those subsequent updates,
|
||||
then later events would re-set them — *if* the events themselves
|
||||
appear in order (they do today). It would work in practice, but it
|
||||
would erase the invariant that "each handler is responsible for one
|
||||
table-shape change" and make the projector's correctness depend on
|
||||
strict event-order replay through `chat_state`. Not worth the
|
||||
subtle coupling; keep the raw INSERT and treat replay as an
|
||||
explicit "wipe + replay" operation (the rewind path does exactly
|
||||
that).
|
||||
|
||||
3. **`meanwhile_scene_started`** — could be made idempotent (the
|
||||
payload supplies `scene_id`), but it shares the `scenes` table with
|
||||
`_apply_scene_opened` (autoincrement) — making one half of the
|
||||
table writers `OR REPLACE` and the other half raw-INSERT is asking
|
||||
for a future bug. Keep both raw, lean on `append_and_apply`.
|
||||
|
||||
4. **`edge_update`** — fundamentally cannot be made idempotent under
|
||||
replay without either changing the event schema (carry absolute
|
||||
values, not deltas) or recording per-event-id "already applied"
|
||||
flags. Either is a multi-week project. The current contract is
|
||||
"edge_update is a delta event; never apply it twice"; the
|
||||
`append_and_apply` rule enforces that contract from the call site.
|
||||
|
||||
**Conclusion:** the handler layer is *correctly* non-idempotent for
|
||||
event-sourcing semantics. The defect class lives in the *caller* layer
|
||||
(routes that mistakenly call `project()` instead of `append_and_apply`).
|
||||
This audit fixes the two known offenders and pins the contract with a
|
||||
regression test (see Step 3).
|
||||
|
||||
---
|
||||
|
||||
## Step 3 — regression test
|
||||
|
||||
Added `tests/test_chat_created_non_idempotent.py`. The test:
|
||||
|
||||
1. Opens a fresh DB and runs the migration chain.
|
||||
2. Appends one `chat_created` event and projects — first projection
|
||||
succeeds.
|
||||
3. Appends a *second* `chat_created` for the same chat id and projects
|
||||
again — asserts that the second projection raises
|
||||
`sqlite3.IntegrityError`.
|
||||
|
||||
The point isn't that the test catches a future "make it idempotent"
|
||||
change automatically; it's that any such change MUST update this test,
|
||||
forcing a deliberate review of all the trade-offs documented above.
|
||||
|
||||
---
|
||||
|
||||
## Files changed
|
||||
|
||||
- `chat/web/bots.py` — swap `append_event`+`project` → `append_and_apply`,
|
||||
drop unused imports.
|
||||
- `chat/web/settings.py` — same swap.
|
||||
- `tests/test_chat_created_non_idempotent.py` — new regression test.
|
||||
- `docs/audits/2026-04-27-project-callers.md` — this file.
|
||||
Reference in New Issue
Block a user