-
Notifications
You must be signed in to change notification settings - Fork 61
Description
Hello,
I am encountering a ValueError: [broadcast_shapes] when attempting to use the zero-shot voice cloning feature (--ref-audio and --ref-text), specifically with Chinese input, even after ensuring all audio and text requirements are met.
The base TTS functionality works fine (tested without reference audio).
🐛 Bug Description
When running the synthesis command with reference audio/text, the process consistently fails at the Duration Predictor step.
🔬 Steps to Reproduce
- Preparation (Confirmed Working):
Base TTS functionality works (English synthesis): python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog." --output ./test_default.wav (Success)
Reference audio is a mono, 24kHz WAV file (confirmed correct format).
- Failing Command (Chinese Input):
Bash
python -m f5_tts_mlx.generate
--text "我现在明白了。"
--ref-audio ./demo_wav/zh/output_24k.wav
--ref-text "我当然知道了"
(Note: --text and --ref-text are short and in the same language.)
❌ Error Traceback
Got reference audio with duration: 1.44 seconds Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 347, in <module> generate( File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 171, in generate wave, _ = f5tts.sample( File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 308, in sample duration = self.predict_duration(cond, text, speed) File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 259, in predict_duration duration_in_sec = self._duration_predictor(cond, text) File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/duration.py", line 241, in __call__ inp = mx.where( ValueError: [broadcast_shapes] Shapes (1,34551,100) and (1,34551,2) cannot be broadcast.
ℹ️ Environment Details
Operating System: macOS (Apple Silicon)
-
Python Version: 3.11
-
Virtual Environment: Yes (f5-tts-311-venv)
-
Package Versions (Please provide your installed version of these):
- f5-tts-mlx: `[Name: f5-tts-mlx
Version: 0.2.6
Summary: F5-TTS - MLX
Home-page: https://github.com/lucasnewman/f5-tts-mlx
Author:
Author-email: Lucas Newman lucasnewman@me.com
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: einops, einx, huggingface_hub, jieba, mlx, numpy, pypinyin, setuptools, sounddevice, soundfile, tqdm, vocos-mlx
Required-by:]`
- mlx: `[Name: mlx
Version: 0.29.3
Summary: A framework for machine learning on Apple silicon.
Home-page: https://github.com/ml-explore/mlx
Author: MLX Contributors
Author-email: mlx@group.apple.com
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: mlx-metal
Required-by: f5-tts-mlx, vocos-mlx]`
💡 Troubleshooting Steps Taken
1 . Confirmed reference audio is mono, 24kHz WAV.
-
Ensured --text and --ref-text are in the same language (Chinese).
-
Attempted to use short text.
-
Cleared Hugging Face cache to force re-download of model weights (duration_v2.safetensors).
Thank you for your help in resolving this.