Bot replies were running long (4 paragraphs of action+dialogue beats
per turn) because we never set max_tokens on the narrative call. Three
tunable knobs now in Settings (set in data/config.toml to override):
- narrative_max_tokens: int = 400
Hard cap on each generated response. ~400 tokens ≈ 1–2 short
paragraphs. Drop to 200 for terse banter, bump to 800+ for longer
scenes.
- narrative_temperature: float = 0.85
Sampling temperature. 0.7 = grounded/consistent (slightly stiff),
0.85 = creative-but-in-character (default), 1.0 = wide variety,
>1.0 = often off-the-rails.
- prompt closing instruction now nudges: "Keep your response to a
single beat — one or two short paragraphs at most. Don't monologue;
leave room for the other person to react."
Both turns.py (post_turn) and regenerate.py forward the params to
client.stream(). FeatherlessClient already passes **params through to
the OpenAI-compat endpoint.
Note: temperature doesn't control length — that was a common
misconception. max_tokens is the actual length cap. Lower temperature
makes word choice more predictable (slightly stiffer voice), not
shorter. Both knobs are useful for different goals.
The form-submit handler in chat.html was setting
``textarea.disabled = true`` synchronously before the browser actually
serialized the form. Disabled form fields are excluded from
submission, so the request body contained ``prose=""`` even when the
user had typed text — which the server (correctly) rejected with the
new empty-prose 400. Net effect: typing "hello" + Send gave a "prose
cannot be empty" error.
Switched to ``readOnly``: same UX (user can't edit while streaming)
but the field IS submitted. The unlock path now also clears the
textarea and refocuses for the next turn.
Empty submission was producing a blank user_turn event in the log and
firing the LLM stream anyway — the bot would invent a response from the
kickoff context alone, producing a monologue with no user input. Two-
layer fix:
- Browser: add `required` to the prose textarea in chat.html so the
form refuses to submit empty.
- Server: 400 in post_turn when prose.strip() is empty. Defense in
depth — if a client bypasses the textarea attribute (custom UI,
curl, etc.), the server still rejects.
Verified live: POST with empty body returns 400; POST with whitespace-
only returns 400; chat shell renders the textarea with required.
Full suite: 168 passed.
Two related issues blocking real-world use of the kickoff parse:
1. Classifier calls take ~12s end-to-end on Featherless for the
complex KickoffParse schema (Hermes-3-8B generating ~1.3KB of
structured JSON). The 10s timeout was firing on most attempts,
causing all 3 retries to time out and the empty-fallback to render
with blank form values. Bumping the default
classifier_timeout_s 10 → 30s gives generous headroom; measured
p99 is ~13s, so 30s is comfortable.
2. Featherless caps concurrent connections per account (2 on free /
lower paid tiers). Each turn flow can fire 4–5 calls (parse,
scene-close detect, narrative stream, two state-update passes)
plus the background significance worker. Without a gate, we'd
exceed the cap and fail.
Added a class-level ``asyncio.Semaphore`` to FeatherlessClient,
shared across all instances, configured once in lifespan from
``Settings.featherless_max_concurrent`` (default 2). Both
``generate`` and ``stream`` acquire the semaphore for the duration
of the call; the stream holds it until the async generator
completes, so token streaming is correctly accounted for.
Verified live: 4/4 sequential kickoff parses for the same bot all
succeed with real parsed values (previously ~50% blank-fallback).
Full suite: 168 passed.
The kickoff parse-and-confirm route was 500-ing intermittently because
Hermes-3 + Featherless's response_format={"type":"json_object"} only
guarantees JSON output, NOT a particular schema. The model was inventing
its own field names (sceneTime, entities, settingDetails) instead of
the KickoffParse fields, causing Pydantic validation to fail on both
classify() retries.
Three changes:
1. Include the Pydantic JSON schema in the system prompt so the model
knows exactly which keys to produce. Affects every classify() call
(kickoff parse, turn parse, scene-close detect, significance,
state-update, scene summarize). Strip ```json fences if the model
wraps its output. Bump retries 2 → 3 (model is stochastic; one extra
attempt closes most of the remaining gap).
2. parse_kickoff() now passes a default empty KickoffParse so the
route degrades to a fillable form instead of 500 when the classifier
ultimately fails. The confirm form is the human-in-the-loop; an
empty form is strictly better UX than a stack trace.
3. Tests updated: bumped canned-failure arrays from 2 → 3 entries to
match the new attempt count; renamed kickoff test from
"raises_when_classifier_fails_twice" to
"falls_back_to_empty_when_classifier_fails" reflecting the new
degraded-but-usable behavior.
Verified live with all 3 sample bots (maya/eli/sam) — kickoff route
returns 200 across multiple attempts. Full suite: 168 passed.
Idempotent seeder for three sample bots (Maya — coworker slow-burn,
Eli — live-in partner, Sam — bartender / new connection). Each is a
distinct relational archetype to exercise the system from different
angles. Run from repo root:
.venv/bin/python scripts/seed_sample_bots.py
Re-running skips ids that already exist. After seeding, walk each bot
through kickoff parse-and-confirm at /bots/<id>/kickoff.
- .gitignore: add *.egg-info/ so editable installs don't show in git status.
- pyproject.toml: add [build-system] and [tool.setuptools.packages.find]
scoped to chat*, fixing pip install -e . which was failing on data/
auto-discovery.
- CLAUDE.md: add Phase 1.5 cleanup backlog section under Phase 1 status,
capturing the small follow-ups surfaced in implementer reviews
(open_db refactor, regenerate SSE broadcast, you-activity purge,
drawer edits for deferred fields, NICE trim order).
Resolves the open/deferred decisions from the v1 requirements brainstorm:
runtime stack, classifier model, token budgets, OOC marker, data layout.
- Runtime: FastAPI + HTMX + SSE (multi-tab sync is a Phase 1 requirement,
not a polish item). 127.0.0.1 only, no auth in v1.
- Classifier model: NousResearch/Hermes-3-Llama-3.1-8B with documented
fallback chain (dolphin-2.9.4-llama3-8b, Meta-Llama-3.1-8B-abliterated).
- Token budgets: 8K hard / 6K soft for narrative, 4K hard for classifier;
Must/Should/Nice trimming tiers spelled out in §3.2.
- OOC marker locked to ((double parens)), configurable.
- All runtime data lives under <repo>/data/ (DB, backups, snapshots,
exports, config). Tree is gitignored. CHAT_DB_PATH env var honored.
CLAUDE.md and the requirements doc updated to match. Decisions log in
the requirements doc appendix extended with the new locks (#17–21).
- docs/plans/2026-04-26-v1-requirements-design.md captures the v1
product requirements and behavioral spec from the initial brainstorm
(use case, scope, data model, authoring, play loop, memory, time,
rollback, phase cut, non-negotiable rules).
- README.md introduces the project for the gitea repo.
- CLAUDE.md updated to reference the requirements doc.
- .gitignore added for macOS metadata.