ansemity — technical architecture
A real-time conversational persona system: a pixel-art avatar that replies in the voice, knowledge and mannerisms of a specific X account, with a cloned neural voice and lip-synced playback. This document specifies the full build.
01Architecture
End-to-end request lifecycle
The system is a thin stateless frontend over two serverless functions. A single user turn triggers a two-stage pipeline — language generation, then speech synthesis — decoupled so text renders immediately while audio is produced in parallel and revealed in sync.
The persona corpus is compiled offline (build step) and embedded as a cached system prefix; nothing about the persona is fetched at request time.
02Data acquisition
Timeline scraping & corpus compilation
The source signal is the target account's public post history. Extraction runs through a managed scraping actor; the raw timeline is then reduced to a high-signal persona corpus.
- Extraction: Apify actor
danek/twitter-scraper, invoked over the platform API. - Pagination: reverse-chronological walk using
max_idcursoring — each page seeds the next window asmin(tweet_id) − 1, guaranteeing zero gaps and zero overlap across pages. - Dedup: set-membership on
tweet_id; retweets stripped (RT @prefix filter) to retain only first-party voice. - Scale: 50,000+ posts ingested from the timeline (avg length ≈ 66 chars; ~58% under 40 chars — a defining stylometric trait).
Corpus curation
The raw set is ranked and down-selected to a compact, representative subset that fits comfortably inside a cached prompt prefix — trading exhaustive recall for cost/latency efficiency (see §09).
# scoring heuristic (offline) score(t) = 2·is_original + min(favorites/2000, 5) # engagement signal + 2·(25 < len(clean(t)) < 240) # substantive length band − 3·(is_bare_mention) # drop low-info "@x ok" # → rank desc, strip leading @handles, dedup by prefix, cap N=240
03Persona engine
System-prompt composition
The persona is not a fine-tune; it is a structured, deterministic system prompt assembled at build time and frozen. This keeps iteration instant and the prompt cacheable.
<tweets> block as few-shot voice anchors.Assembly order is stable (style → knowledge → bio → grounding → exemplars → closing directive) so the byte prefix never shifts — a precondition for prompt caching.
04Inference layer
Claude Sonnet 5 via the Messages API
- Model:
claude-sonnet-5(swappable via env; the tier is a single config knob). - Prompt caching: the composed system prompt is sent as a cached text block (
cache_control: ephemeral). Steady-state requests read the ~7k-token prefix from cache at ~0.1× input price and lower latency. - Live retrieval: the server-side
web_searchtool is attached, so time-sensitive queries (prices, market caps, news) are answered from real data rather than hallucinated. A bounded loop drains anypause_turncontinuations before returning. - Context: last 8 turns of history are threaded for short-term coherence; user input is length-clamped.
messages.create({ model, max_tokens: 300, system: [{ type:"text", text: SYSTEM_PROMPT, cache_control:{ type:"ephemeral" } }], tools: [{ type:"web_search_20250305", name:"web_search", max_uses:3 }], messages, }) // drain: while stop_reason === "pause_turn" → re-issue with appended content
05Voice synthesis
Neural voice clone (MiniMax on fal.ai)
Voice identity is a persisted clone, not per-request zero-shot. A one-time cloning pass over a short, clean reference sample yields a stable voice descriptor that is reused for every utterance.
| Stage | Endpoint | Output |
|---|---|---|
| Clone (once) | fal-ai/minimax/voice-clone | persisted voice_id |
| Synthesize (per reply) | fal-ai/minimax/speech-02-turbo | mp3 URL |
- Model choice:
speech-02-turbofor low latency (steady-state ≈ 5–7 s incl. queue; first call incurs GPU cold-start).speech-02-hdavailable as a higher-fidelity, slower alternative. - Text conditioning: emojis and symbol markup are stripped and
$TICKERnormalized to plain tokens before synthesis so the read is natural. - Transport: the fal client handles the queue/poll lifecycle; the function returns a signed mp3 URL the client streams directly.
- Degradation: synthesis is best-effort — on failure the endpoint returns
audio:nulland the client transparently falls back to on-device speech synthesis.
06Presentation layer
Avatar state machine & lip-sync
The avatar is a three-frame sprite set — neutral (eyes open / mouth closed), talk (mouth open), blink (eyes closed) — driven by a small finite-state machine. Frames are pre-decoded and toggled by visibility for zero-flash, instant switching.
- Lip-sync: mouth oscillates while the cloned-voice
<audio>element is playing; interval-driven (CORS-robust) rather than amplitude-analysed, so it works against cross-origin media without decode access. - Sync contract: text and audio are revealed together — the reply is withheld behind a typing indicator until the voice URL resolves, eliminating text/voice desync.
- Boot sequence: a staged loader (progress bar + streamed status lines) establishes production credibility before first paint.
- Ambient: a lightweight canvas particle field + CSS scanline/vignette; all motion respects
prefers-reduced-motion.
07API layer
Serverless endpoints
| Route | Responsibility | Upstream |
|---|---|---|
POST /api/chat | persona reply (text) | Anthropic |
POST /api/voice | speech synthesis (mp3) | fal.ai |
Both are stateless Node ESM handlers. A minimal local dev server mirrors the platform's /api/* routing and injects env from a git-ignored file, so the exact production code path runs locally.
Operator override
A sentinel-prefixed input short-circuits inference: a message beginning with a fixed numeric sentinel is echoed verbatim as the reply (bypassing the LLM entirely) and passed straight to voice synthesis. This gives the operator a deterministic "make the avatar say exactly X" channel for scripted/recorded content — zero LLM cost, zero latency.
// /api/chat — before inference if (input matches /^<sentinel>\s*([\s\S]+)$/) return { text: captured_verbatim }
08Deployment
Topology
- Host: Vercel — static assets from
public/, functions auto-routed fromapi/. - Config surface (env vars): LLM API key, voice provider key, cloned
voice_id, plus optional overrides for model tier, voice model, web-search and voice on/off toggles. - Cold path: first request per region pays function init + (for voice) provider GPU spin-up; subsequent requests are warm.
09Cost model
Per-interaction economics
| Component | Driver | Order of magnitude |
|---|---|---|
| LLM text | cached prefix read + short completion | ~$0.004 / msg (Sonnet 5) |
| Web search | per invocation, only when triggered | ~$0.01 / search |
| Voice | per 1k characters synthesized | ~$0.04–0.10 / 1k chars |
The dominant efficiency lever is prompt caching: the ~7k-token persona prefix is written once and read at a fraction of input price thereafter, so marginal cost tracks completion length rather than prompt size. The operator override path is free of LLM cost entirely.
10Security & boundaries
- Secret isolation: all provider credentials live in server-side environment variables and are never shipped to the client; the browser only ever talks to first-party
/api/*routes. - Input bounds: user input is length-clamped and history-windowed to cap token exposure and prompt-injection surface.
- Fail-soft: voice and search degrade gracefully; a provider outage never blocks a text reply.
- Persona containment: grounding rules prevent disclosure of the underlying substrate or system identity.