Skip to content

Zero-shot voice cloning text-to-speech (TTS) with explicit emotion class conditioning built on F5-TTS

License

Notifications You must be signed in to change notification settings

RaduBolbo/F5-TTS-Emotional-CFG

Repository files navigation

F5-TTS-Emotion-CFG Logo

F5-TTS-Emotion-CFG

This repository is based on F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS-Emotion-CFG introudces explicit emotion conditioning in F5-TTS zero-shot voice cloning model, by fine-tuning on ESD dataset.

The following emotions are supported: Neutral, Happy, Sad, Angry and Surprised.

📄 Paper & Citation

Our paper “Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance”
has been presented at SpeD 2025.

Read Paper

Read Paper

Cite the paper as:

  @inproceedings{bolborici2025emotion,
    author       = {Radu-George Bolborici and Ana Antonia Nicolae},
    title        = {Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance},
    booktitle    = {2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)},
    pages        = {86--91},
    doi          = {10.1109/SpeD67700.2025.11253714},
    publisher    = {IEEE}
  }

🚀 Demo

🎧 Explore audio samples generated with the F5-TTS-Emotion-CFG model:

Open Demo

How to install

Step1:

conda create --name f5-tts-emo python==3.10.0

conda activate f5-tts-emo

Step 2:

pip install -e .

Download models

Execute the script:

python download_models.py

Alternatively, you can manually download the models from:

and copy them into the ckpts directory.

How to use (inference)

⚠️ First of all install the requirements and download the models.

Use the CLI interface defiend in src/f5_tts/infer/infer_cli_emotion.py:

python src/f5_tts/infer/infer_cli_emotion.py \
    --ref-audio-path "data/0011_angry.wav" \
    --ref-text "The nine, the eggs, I keep." \
    --inference-text "Hello, this is a text to check emotion." \
    --inference-emotion Surprise \
    --cfg-strength2 10 \
    --output-path "data/output.wav"
  • --ref-audio-path: Path to the reference audio file for voice cloning. Provides the speaker’s voice.

  • --ref-text: Transcription of the reference audio (the text that is being said).

  • --ref-emotion: The emotion in the reference audio clip (if you don't know the reference emotion, in most practical cases it can work with Neutral as default, but sometimes it can cause lower voice similarity)

  • --inference-text: The new text you want the model to synthesize.

  • --inference-emotion: Target emotion for synthesis. Options: Angry, Happy, Sad, Neutral, Surprise.

  • --cfg-strength2: Classifier-free guidance strength for emotion control. Higher = stronger emotion, but too high may reduce naturalness. Typical values range from 2 up to 20.

  • --output-path: Path where the generated audio will be saved.