Using Audio Tags for Google Gemini and ElevenLabs Voices – Help Center

Audio tags can be used to help shape the exact performance, emotion, and pacing of your text-to-speech audio. This article helps to explain how to add in directive cues to your text-to-speech via audio tags.

What are audio tags?

Audio tags are text-based commands, wrapped in square brackets (e.g., [whispers], [angry], [pause]), that you insert directly into your text-to-speech (TTS) script for ElevenLabs and Gemini voices. Instead of just reading the text flatly, the AI model interprets these tags as directive cues to control the audible action, and emotional tone. For ElevenLabs only, you can also use audio tags to add non-verbal sounds such as [clapping].

How to use audio tags?

Both ElevenLabs and Gemini voices use the exact same formatting. You can add directive cues inside square brackets in front of the text in which you want the directive cue to apply to. For example: [excited].

What kinds of directive cues can be applied to ElevenLabs and Gemini TTS?

1. Emotional Directives & Tone Shifts
Both ElevenLabs and Gemini excel at changing the emotional state of the speaker on the fly. You can dictate the exact feeling behind the words.

What works: [angry], [sad], [playful], [menacing], [excited], [thoughtful]

2. Pacing, Pauses, and Cadence
Both ElevenLabs and Gemini allow you to control the rhythm and timing of the delivery, ensuring the AI doesn't just rush through the text.

What works: [pause], [slow], [rushed], [measured]

3. Vocal Delivery & Volume Control
You can change how the character is speaking, adjusting their volume or the physical texture of their voice to fit the scene.

What works: [whispering] / [whispers], [shouting], [quiet], [low]

4. Human Reactions & Vocalizations
Both ElevenLabs and Gemini have moved beyond just reading words and can now generate natural, non-verbal human sounds to make the dialogue feel authentic.

What works: [laughs] / [laughing], [sighs]

5. Mid-Sentence Shifts
Both platforms allow you to place tags anywhere in the script. You don't have to generate separate audio files for different emotions; you can shift the tone, pacing, or delivery right in the middle of a sentence.

What works: For example, "I've been trying to fix this code for three hours and [sighs] / [laughs] / [cries] I am just completely out of ideas."

Related to

What are audio tags?

How to use audio tags?

What kinds of directive cues can be applied to ElevenLabs and Gemini TTS?

Related articles