ElevenLabs TTS (Text-to-Speech) with emotional audio tags for expressive voice synthesis. WhatsApp-compatible voice messages with Opus conversion. Supports 7...
Install
Documentation
ElevenLabs TTS (Text-to-Speech)
Generate expressive voice messages using ElevenLabs v3 with audio tags.
Prerequisites
- -ElevenLabs API Key (
ELEVENLABS_API_KEY): Required. Get one at [elevenlabs.io](https://elevenlabs.io) → Profile → API Keys. Configure inopenclaw.jsonundermessages.tts.elevenlabs.apiKey. - -ffmpeg: Required for audio format conversion (MP3 → Opus for WhatsApp compatibility). Must be installed and available on PATH.
Quick Start Examples
Storytelling (emotional journey):[soft] It started like any other day... [pause] But something felt different. [nervous] My hands were shaking as I opened the envelope. [gasps] I got in! [excited] I actually got in! [laughs] [happy] This changes everything!
Horror/Suspense (building dread):
[whispers] The house has been empty for years... [pause] At least, that's what they told me. [nervous] But I keep hearing footsteps. [scared] They're getting closer. [gasps] [panicking] The door— it's opening by itself!
Conversation with reactions:
[curious] So what happened at the meeting? [pause] [surprised] Wait, they fired him?! [gasps] [sad] That's terrible... [sighs] He had a family. [thoughtful] I wonder what he'll do now.
Hebrew (romantic moment):
[soft] היא עמדה שם, מול השקיעה... [pause] הלב שלי פעם כל כך חזק. [nervous] לא ידעתי מה להגיד. [hesitates] אני... [breathes] [tender] את יודעת שאני אוהב אותך, נכון?
Spanish (celebration to reflection):
[excited] ¡Lo logramos! [laughs] [happy] No puedo creerlo... [pause] [thoughtful] Fueron tantos años de trabajo. [emotional] [soft] Gracias a todos los que creyeron en mí. [sighs] [content] Valió la pena cada momento.
Configuration (OpenClaw)
In openclaw.json, configure TTS under messages.tts:
{
"messages": {
"tts": {
"provider": "elevenlabs",
"elevenlabs": {
"apiKey": "sk_your_api_key_here",
"voiceId": "pNInz6obpgDQGcFmaJgB",
"modelId": "eleven_v3",
"languageCode": "en",
"voiceSettings": {
"stability": 0.5,
"similarityBoost": 0.75,
"style": 0,
"useSpeakerBoost": true,
"speed": 1
}
}
}
}
}
Getting your API Key:
1. Go to https://elevenlabs.io
2. Sign up/login
3. Click profile → API Keys
4. Copy your key
Recommended Voices for v3
These premade voices are optimized for v3 and work well with audio tags:
| Voice | ID | Gender | Accent | Best For |
|-------|-----|--------|--------|----------|
| Adam | pNInz6obpgDQGcFmaJgB | Male | American | Deep narration, general use |
| Rachel | 21m00Tcm4TlvDq8ikWAM | Female | American | Calm narration, conversational |
| Brian | nPczCjzI2devNBz1zQrb | Male | American | Deep narration, podcasts |
| Charlotte | XB0fDUnXU5powFXDhCwa | Female | English-Swedish | Expressive, video games |
| George | JBFqnCBsd6RMkjVDRZzb | Male | British | Raspy narration, storytelling |
- -Browse: https://elevenlabs.io/voice-library
- -v3-optimized collection: https://elevenlabs.io/app/voice-library/collections/aF6JALq9R6tXwCczjhKH
- -API:
GET https://api.elevenlabs.io/v1/voices
- -Use IVC (Instant Voice Clone) or premade voices - PVC not optimized for v3 yet
- -Match voice character to your use case (whispering voice won't shout well)
- -For expressive IVCs, include varied emotional tones in training samples
Model Settings
- -Model:
eleven_v3(alpha) - ONLY model supporting audio tags - -Languages: 70+ supported with full audio tag control
Stability Modes
| Mode | Stability | Description |
|------|-----------|-------------|
| Creative | 0.3-0.5 | More emotional/expressive, may hallucinate |
| Natural | 0.5-0.7 | Balanced, closest to original voice |
| Robust | 0.7-1.0 | Highly stable, less responsive to tags |
For audio tags, use Creative (0.5) or Natural. Higher stability reduces tag responsiveness.
Speed Control
Range: 0.7 (slow) to 1.2 (fast), default 1.0
Extreme values affect quality. For pacing, prefer audio tags like [rushed] or [drawn out].
Critical Rules
Length Limits
- -Optimal: <800 characters per segment (best quality)
- -Maximum: 10,000 characters (API hard limit)
- -Quality degrades with longer text - voice becomes inconsistent
Audio Tags - Best Practices for Natural Sound
How many tags to use:- -1-2 tags per sentence or phrase (not more!)
- -Tags persist until the next tag - no need to repeat
- -Overusing tags sounds unnatural and robotic
- -At emotional transition points
- -Before key dramatic moments
- -When energy/pace changes
- -Write text that *matches* the tag emotion
- -Longer text with context = better interpretation
- -Example:
[nervous] I... I'm not sure about this. What if it doesn't work?works better than[nervous] Hello.
- -
[nervously][whispers]= nervous whispering - -
[excited][laughs]= excited laughter - -Keep combinations to 2 tags max
- -v3 is non-deterministic - same text = different outputs
- -Generate 3+ versions, pick the best
- -Small text tweaks can improve results
- -Don't use
[shouts]on a whispering voice - -Don't use
[whispers]on a loud/energetic voice - -Test tags with your chosen voice
SSML Not Supported
v3 does NOT support SSML break tags. Use audio tags and punctuation instead.
Punctuation Effects (use with tags!)
Punctuation enhances audio tags:
- -Ellipses (...) → dramatic pauses:
[nervous] I... I don't know... - -CAPS → emphasis:
[excited] That's AMAZING! - -Dashes (—) → interruptions:
[explaining] So what you do is— [interrupting] Wait! - -Question marks → uncertainty:
[nervous] Are you sure about this? - -Exclamation! → energy boost:
[happy] We did it!
Combine tags + punctuation for maximum effect:
[tired] It was a long day... [sighs] Nobody listens anymore.
WhatsApp Voice Messages
Complete Workflow
1. Generate with tts tool (returns MP3)
2. Convert to Opus (required for Android!)
3. Send with message tool
Step-by-Step
1. Generate TTS (add [pause] at end to prevent cutoff):tts text="[excited] This is amazing! [pause]" channel=whatsapp
Returns: MEDIA:/tmp/tts-xxx/voice-123.mp3
ffmpeg -i /tmp/tts-xxx/voice-123.mp3 -c:a libopus -b:a 64k -vbr on -application voip /tmp/tts-xxx/voice-123.ogg
3. Send the Opus file:
> Note: The message field below contains a Unicode Left-to-Right Mark (U+200E) between the quotes.
> This is intentional — WhatsApp requires a non-empty message body to send voice notes.
> The LTR mark is invisible but satisfies this requirement without displaying any text.
message action=send channel=whatsapp target="+972..." filePath="/tmp/tts-xxx/voice-123.ogg" asVoice=true message=""
Why Opus?
| Format | iOS | Android | Transcribe |
|--------|-----|---------|------------|
| MP3 | ✅ Works | ❌ May fail | ❌ No |
| Opus (.ogg) | ✅ Works | ✅ Works | ✅ Yes |
Always convert to Opus - it's the only format that:- -Works on all devices (iOS + Android)
- -Supports WhatsApp's transcribe button
Audio Cutoff Fix
ElevenLabs sometimes cuts off the last word. Always add [pause] or ... at the end:
[excited] This is amazing! [pause]
Long-Form Audio (Podcasts)
For content >800 chars:
1. Split into short segments (<800 chars each)
2. Generate each with tts tool
3. Concatenate with ffmpeg:
cat > list.txt << EOF
file '/path/file1.mp3'
file '/path/file2.mp3'
EOF
ffmpeg -f concat -safe 0 -i list.txt -c copy final.mp3
4. Convert to Opus for WhatsApp
5. Send as single voice message
Important: Don't mention "part 2" or "chapter" - keep it seamless.Multi-Speaker Dialogue
v3 can handle multiple characters in one generation:
Jessica: [whispers] Did you hear that?
Chris: [interrupting] —I heard it too!
Jessica: [panicking] We need to hide!
Dialogue tags: [interrupting], [overlapping], [cuts in], [interjecting]
Audio Tags Quick Reference
| Category | Tags | When to Use |
|----------|------|-------------|
| Emotions | [excited], [happy], [sad], [angry], [nervous], [curious] | Main emotional state - use 1 per section |
| Delivery | [whispers], [shouts], [soft], [rushed], [drawn out] | Volume/speed changes |
| Reactions | [laughs], [sighs], [gasps], [clears throat], [gulps] | Natural human moments - sprinkle sparingly |
| Pacing | [pause], [hesitates], [stammers], [breathes] | Dramatic timing |
| Character | [French accent], [British accent], [robotic tone] | Character voice shifts |
| Dialogue | [interrupting], [overlapping], [cuts in] | Multi-speaker conversations |
Most effective tags (reliable results):- -Emotions:
[excited],[nervous],[sad],[happy] - -Reactions:
[laughs],[sighs],[whispers] - -Pacing:
[pause]
- -Sound effects:
[explosion],[gunshot] - -Accents: results vary by voice
Troubleshooting
Tags read aloud?- -Verify using
eleven_v3model - -Use IVC/premade voices, not PVC
- -Simplify tags (no "tone" suffix)
- -Increase text length (250+ chars)
- -Segment is too long - split at <800 chars
- -Regenerate (v3 is non-deterministic)
- -Try lower stability setting
- -Convert to Opus format (see above)
- -Voice may not match tag style
- -Try Creative stability mode (0.5)
- -Add more context around the tag
Launch an agent with Elevenlabs Tts on Termo.