ansemity — technical architecture

A real-time conversational persona system: a pixel-art avatar that replies in the voice, knowledge and mannerisms of a specific X account, with a cloned neural voice and lip-synced playback. This document specifies the full build.

LLM Claude Sonnet 5 Voice MiniMax neural clone Corpus tweet-derived persona Runtime serverless (Vercel) Search live web tool

01Architecture

End-to-end request lifecycle

The system is a thin stateless frontend over two serverless functions. A single user turn triggers a two-stage pipeline — language generation, then speech synthesis — decoupled so text renders immediately while audio is produced in parallel and revealed in sync.

┌─────────────── CLIENT (static SPA) ───────────────┐ │ chat input ──▶ POST /api/chat ──▶ text │ │ │ │ │ └───────▶ POST /api/voice ─▶ mp3 url │ │ │ │ │ avatar FSM ◀── lip-sync ◀── <audio> playback │ └───────────────────────────────────────────────────┘ │ │ ▼ ▼ ┌─────────────────┐ ┌─────────────────┐ │ Anthropic API │ │ fal.ai │ │ Claude Sonnet 5│ │ MiniMax TTS │ │ + web_search │ │ cloned voice │ └─────────────────┘ └─────────────────┘ ▲ ┌─────────────────┐ │ system prompt = │ │ persona + bio + │ │ 240-tweet corpus│ (prompt-cached) └─────────────────┘

The persona corpus is compiled offline (build step) and embedded as a cached system prefix; nothing about the persona is fetched at request time.

02Data acquisition

Timeline scraping & corpus compilation

The source signal is the target account's public post history. Extraction runs through a managed scraping actor; the raw timeline is then reduced to a high-signal persona corpus.

Extraction: Apify actor danek/twitter-scraper, invoked over the platform API.
Pagination: reverse-chronological walk using max_id cursoring — each page seeds the next window as min(tweet_id) − 1, guaranteeing zero gaps and zero overlap across pages.
Dedup: set-membership on tweet_id; retweets stripped (RT @ prefix filter) to retain only first-party voice.
Scale: 50,000+ posts ingested from the timeline (avg length ≈ 66 chars; ~58% under 40 chars — a defining stylometric trait).

Corpus curation

The raw set is ranked and down-selected to a compact, representative subset that fits comfortably inside a cached prompt prefix — trading exhaustive recall for cost/latency efficiency (see §09).

# scoring heuristic (offline)
score(t) = 2·is_original
         + min(favorites/2000, 5)      # engagement signal
         + 2·(25 < len(clean(t)) < 240) # substantive length band
         − 3·(is_bare_mention)          # drop low-info "@x ok"
# → rank desc, strip leading @handles, dedup by prefix, cap N=240

03Persona engine

System-prompt composition

The persona is not a fine-tune; it is a structured, deterministic system prompt assembled at build time and frozen. This keeps iteration instant and the prompt cacheable.

Style layer

Knowledge layer

Projects, market positions, trading & life philosophy — distilled into declarative statements.

Bio layer

Verifiable biographical facts sourced from public web research, phrased first-person.

Grounding layer

Behavioral guardrails: reason freely on any topic, never reveal the tweet substrate, never break character, no fabricated hard facts.

Exemplar layer

The curated N=240 real posts, wrapped in a <tweets> block as few-shot voice anchors.

Assembly order is stable (style → knowledge → bio → grounding → exemplars → closing directive) so the byte prefix never shifts — a precondition for prompt caching.

04Inference layer

Claude Sonnet 5 via the Messages API

Model: claude-sonnet-5 (swappable via env; the tier is a single config knob).
Prompt caching: the composed system prompt is sent as a cached text block (cache_control: ephemeral). Steady-state requests read the ~7k-token prefix from cache at ~0.1× input price and lower latency.
Live retrieval: the server-side web_search tool is attached, so time-sensitive queries (prices, market caps, news) are answered from real data rather than hallucinated. A bounded loop drains any pause_turn continuations before returning.
Context: last 8 turns of history are threaded for short-term coherence; user input is length-clamped.

messages.create({
  model, max_tokens: 300,
  system: [{ type:"text", text: SYSTEM_PROMPT,
             cache_control:{ type:"ephemeral" } }],
  tools:  [{ type:"web_search_20250305", name:"web_search", max_uses:3 }],
  messages,
})
// drain: while stop_reason === "pause_turn" → re-issue with appended content

05Voice synthesis

Neural voice clone (MiniMax on fal.ai)

Voice identity is a persisted clone, not per-request zero-shot. A one-time cloning pass over a short, clean reference sample yields a stable voice descriptor that is reused for every utterance.

Stage	Endpoint	Output
Clone (once)	`fal-ai/minimax/voice-clone`	persisted `voice_id`
Synthesize (per reply)	`fal-ai/minimax/speech-02-turbo`	mp3 URL

Model choice: speech-02-turbo for low latency (steady-state ≈ 5–7 s incl. queue; first call incurs GPU cold-start). speech-02-hd available as a higher-fidelity, slower alternative.
Text conditioning: emojis and symbol markup are stripped and $TICKER normalized to plain tokens before synthesis so the read is natural.
Transport: the fal client handles the queue/poll lifecycle; the function returns a signed mp3 URL the client streams directly.
Degradation: synthesis is best-effort — on failure the endpoint returns audio:null and the client transparently falls back to on-device speech synthesis.

06Presentation layer

Avatar state machine & lip-sync

The avatar is a three-frame sprite set — neutral (eyes open / mouth closed), talk (mouth open), blink (eyes closed) — driven by a small finite-state machine. Frames are pre-decoded and toggled by visibility for zero-flash, instant switching.

idle ──(random 2.6–6.2s)──▶ blink(140ms) ──▶ idle │ audio.play() ▼ SPEAKING ──(mouth oscillator ~8Hz: neutral⇄talk)──▶ audio.ended ──▶ idle

Lip-sync: mouth oscillates while the cloned-voice <audio> element is playing; interval-driven (CORS-robust) rather than amplitude-analysed, so it works against cross-origin media without decode access.
Sync contract: text and audio are revealed together — the reply is withheld behind a typing indicator until the voice URL resolves, eliminating text/voice desync.
Boot sequence: a staged loader (progress bar + streamed status lines) establishes production credibility before first paint.
Ambient: a lightweight canvas particle field + CSS scanline/vignette; all motion respects prefers-reduced-motion.

07API layer

Serverless endpoints

Route	Responsibility	Upstream
`POST /api/chat`	persona reply (text)	Anthropic
`POST /api/voice`	speech synthesis (mp3)	fal.ai

Both are stateless Node ESM handlers. A minimal local dev server mirrors the platform's /api/* routing and injects env from a git-ignored file, so the exact production code path runs locally.

Operator override

A sentinel-prefixed input short-circuits inference: a message beginning with a fixed numeric sentinel is echoed verbatim as the reply (bypassing the LLM entirely) and passed straight to voice synthesis. This gives the operator a deterministic "make the avatar say exactly X" channel for scripted/recorded content — zero LLM cost, zero latency.

// /api/chat — before inference
if (input matches /^<sentinel>\s*([\s\S]+)$/)
     return { text: captured_verbatim }

08Deployment

Topology

Host: Vercel — static assets from public/, functions auto-routed from api/.
Config surface (env vars): LLM API key, voice provider key, cloned voice_id, plus optional overrides for model tier, voice model, web-search and voice on/off toggles.
Cold path: first request per region pays function init + (for voice) provider GPU spin-up; subsequent requests are warm.

frontendstatic SPA, single document + 3 sprite assets

functions2 × Node ESM serverless handlers

build stepoffline corpus compile → cached system prompt

secretsprovider keys — env only, never client-exposed

09Cost model

Per-interaction economics

Component	Driver	Order of magnitude
LLM text	cached prefix read + short completion	~$0.004 / msg (Sonnet 5)
Web search	per invocation, only when triggered	~$0.01 / search
Voice	per 1k characters synthesized	~$0.04–0.10 / 1k chars

The dominant efficiency lever is prompt caching: the ~7k-token persona prefix is written once and read at a fraction of input price thereafter, so marginal cost tracks completion length rather than prompt size. The operator override path is free of LLM cost entirely.

10Security & boundaries

Secret isolation: all provider credentials live in server-side environment variables and are never shipped to the client; the browser only ever talks to first-party /api/* routes.
Input bounds: user input is length-clamped and history-windowed to cap token exposure and prompt-injection surface.
Fail-soft: voice and search degrade gracefully; a provider outage never blocks a text reply.
Persona containment: grounding rules prevent disclosure of the underlying substrate or system identity.

Disclosure: this is a fan/parody persona system — an AI approximation, not the real person. All secrets referenced in this document are redacted; identifiers and endpoints are described generically.