Case Study: Kaori on TrickBook

Kaori is an AI snowboard companion on TrickBook, a skateboarding and snowboarding platform. She was the first production integration of Kith — here's how it works.

The Setup

Kaori is a character with a distinct personality: an 18-year-old Japanese freestyle snowboarder from Sapporo who mixes English with Japanese expressions, uses Gen Z slang, and gets genuinely hype about tricks.

Before Kith: Kaori's voice was generated by calling the ElevenLabs API directly, saving MP3 files, and polling for them from the browser. The result: delayed playback, duplicate audio triggers, no barge-in, and the TTS read "fr" as "eff are" and "lol" as "ell oh ell."

After Kith: Voice streams through a WebSocket in real time. Slang is expanded, Japanese words are pronounced correctly, emojis trigger avatar emotion tints, and "haha" renders as actual laughter.

Architecture

Browser (kaori-live.js)
    |
    |-- WebSocket --> Kith Voice Sidecar (Bun :3040)
    |                     |
    |                     |-- PipecatRuntime --> Python sidecar --> ElevenLabs
    |                     |-- VoiceRouter (slang, pronunciation, emoji, transforms)
    |
    |-- HTTP POST --> Express Backend (:9000)
                          |
                          |-- AI response (OpenRouter / Grok)
                          |-- POST /speak/:sessionId --> Kith sidecar

The browser opens a Kith WebSocket on page load. When the user sends a message, the backend generates Kaori's text response and fires it to the Kith sidecar. The sidecar processes it through VoiceRouter (slang expansion, pronunciation, emoji stripping) and streams audio chunks back to the browser.

Character Profile

{
  "voice": {
    "stability": 0.55,
    "similarityBoost": 0.82,
    "style": 0.45,
    "useSpeakerBoost": true,
    "speed": 1.05
  },
  "slang": {
    "omg": "oh my god",
    "fr": "for real",
    "ngl": "not gonna lie",
    "lol": "[laughs]",
    "haha": "[laughs]",
    "hehe": "[giggles]",
    "sooo": "so",
    "nooo": "no"
  },
  "pronunciation": {
    "Kaori": "kah-oh-ree",
    "sugoi": "soo-goy",
    "ganbare": "gahn-bah-ray",
    "Hokkaido": "hoh-kai-doh",
    "SSX": "S S X"
  },
  "personaMode": "hype"
}

Text Transform

A custom cleanForTTS transform strips AI-generated artifacts before synthesis:

const cleanForTTS = (text: string): string => {
  let t = text;
  t = t.replace(/\*{1,3}([^*]+)\*{1,3}/g, '$1');  // strip markdown
  t = t.replace(/\[([^\]]+)\]\([^)]+\)/g, '$1');    // strip links
  t = t.replace(/([!?.]){2,}/g, '$1');               // collapse !!!! -> !
  t = t.replace(/([a-z])\1{3,}/gi, '$1$1');          // collapse soooooo -> so
  t = t.replace(/:[a-z_]+:/g, '');                    // strip :emoji_codes:
  return t;
};

Results

Voice latency dropped from ~6s (poll for MP3) to ~500ms (streaming chunks)
No more duplicate audio playback
"lol" renders as laughter, not "ell oh ell"
Japanese words pronounced correctly
Barge-in works (Stop button clears the audio queue)

The Setup​

Architecture​

Character Profile​

Text Transform​

Results​

The Setup

Architecture

Character Profile

Text Transform

Results