F5-TTS-Emotion-CFG

This repository is based on F5-TTS: A Fairytaler that Fakes Fluent and Faithful Speech with Flow Matching

F5-TTS-Emotion-CFG introudces explicit emotion conditioning in F5-TTS zero-shot voice cloning model, by fine-tuning on ESD dataset.

The following emotions are supported: Neutral, Happy, Sad, Angry and Surprised.

📄 Paper & Citation

Our paper “Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance”
has been presented at SpeD 2025.

Cite the paper as:

  @inproceedings{bolborici2025emotion,
    author       = {Radu-George Bolborici and Ana Antonia Nicolae},
    title        = {Adding Emotion Conditioning in Speech Synthesis via Multi-Term Classifier-Free Guidance},
    booktitle    = {2025 International Conference on Speech Technology and Human-Computer Dialogue (SpeD)},
    pages        = {86--91},
    doi          = {10.1109/SpeD67700.2025.11253714},
    publisher    = {IEEE}
  }

🚀 Demo

🎧 Explore audio samples generated with the F5-TTS-Emotion-CFG model:

How to install

Step1:

conda create --name f5-tts-emo python==3.10.0

conda activate f5-tts-emo

Step 2:

pip install -e .

Download models

Execute the script:

python download_models.py

Alternatively, you can manually download the models from:

and copy them into the ckpts directory.

How to use (inference)

⚠️ First of all install the requirements and download the models.

Use the CLI interface defiend in src/f5_tts/infer/infer_cli_emotion.py:

python src/f5_tts/infer/infer_cli_emotion.py \
    --ref-audio-path "data/0011_angry.wav" \
    --ref-text "The nine, the eggs, I keep." \
    --inference-text "Hello, this is a text to check emotion." \
    --inference-emotion Surprise \
    --cfg-strength2 10 \
    --output-path "data/output.wav"

--ref-audio-path: Path to the reference audio file for voice cloning. Provides the speaker’s voice.
--ref-text: Transcription of the reference audio (the text that is being said).
--ref-emotion: The emotion in the reference audio clip (if you don't know the reference emotion, in most practical cases it can work with Neutral as default, but sometimes it can cause lower voice similarity)
--inference-text: The new text you want the model to synthesize.
--inference-emotion: Target emotion for synthesis. Options: Angry, Happy, Sad, Neutral, Surprise.
--cfg-strength2: Classifier-free guidance strength for emotion control. Higher = stronger emotion, but too high may reduce naturalness. Typical values range from 2 up to 20.
--output-path: Path where the generated audio will be saved.

Name		Name	Last commit message	Last commit date
Latest commit History 24 Commits
assets		assets
ckpts		ckpts
data		data
src/f5_tts		src/f5_tts
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
download_models.py		download_models.py
environment.yaml		environment.yaml
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

F5-TTS-Emotion-CFG

📄 Paper & Citation

🚀 Demo

How to install

Download models

How to use (inference)

About

Uh oh!

Releases

Packages

Languages

License

RaduBolbo/F5-TTS-Emotional-CFG

Folders and files

Latest commit

History

Repository files navigation

F5-TTS-Emotion-CFG

📄 Paper & Citation

🚀 Demo

How to install

Download models

How to use (inference)

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages