This repository is based on F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching
F5-TTS-Emotion-CFG introudces explicit emotion conditioning in F5-TTS zero-shot voice cloning model, by fine-tuning on ESD dataset.
The following emotions are supported: Neutral, Happy, Sad, Angry and Surprised.
Our paper “Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance”
has been presented at SpeD 2025.
Cite the paper as:
@inproceedings{bolborici2025emotion,
author = {Radu-George Bolborici and Ana Antonia Nicolae},
title = {Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance},
booktitle = {2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)},
pages = {86--91},
doi = {10.1109/SpeD67700.2025.11253714},
publisher = {IEEE}
}🎧 Explore audio samples generated with the F5-TTS-Emotion-CFG model:
Step1:
conda create --name f5-tts-emo python==3.10.0
conda activate f5-tts-emo
Step 2:
pip install -e .
Execute the script:
python download_models.py
Alternatively, you can manually download the models from:
and copy them into the ckpts directory.
Use the CLI interface defiend in src/f5_tts/infer/infer_cli_emotion.py:
python src/f5_tts/infer/infer_cli_emotion.py \
--ref-audio-path "data/0011_angry.wav" \
--ref-text "The nine, the eggs, I keep." \
--inference-text "Hello, this is a text to check emotion." \
--inference-emotion Surprise \
--cfg-strength2 10 \
--output-path "data/output.wav"
-
--ref-audio-path: Path to the reference audio file for voice cloning. Provides the speaker’s voice.
-
--ref-text: Transcription of the reference audio (the text that is being said).
-
--ref-emotion: The emotion in the reference audio clip (if you don't know the reference emotion, in most practical cases it can work with
Neutralas default, but sometimes it can cause lower voice similarity) -
--inference-text: The new text you want the model to synthesize.
-
--inference-emotion: Target emotion for synthesis. Options:
Angry,Happy,Sad,Neutral,Surprise. -
--cfg-strength2: Classifier-free guidance strength for emotion control. Higher = stronger emotion, but too high may reduce naturalness. Typical values range from 2 up to 20.
-
--output-path: Path where the generated audio will be saved.
