Skip to content

ValueError: [broadcast_shapes] when using --ref-audio (Chinese/Multilingual input) #41

@cavalierforgive

Description

@cavalierforgive

Hello,

I am encountering a ValueError: [broadcast_shapes] when attempting to use the zero-shot voice cloning feature (--ref-audio and --ref-text), specifically with Chinese input, even after ensuring all audio and text requirements are met.

The base TTS functionality works fine (tested without reference audio).

🐛 Bug Description
When running the synthesis command with reference audio/text, the process consistently fails at the Duration Predictor step.

🔬 Steps to Reproduce

  1. Preparation (Confirmed Working):

Base TTS functionality works (English synthesis): python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog." --output ./test_default.wav (Success)

Reference audio is a mono, 24kHz WAV file (confirmed correct format).

  1. Failing Command (Chinese Input):

Bash

python -m f5_tts_mlx.generate
--text "我现在明白了。"
--ref-audio ./demo_wav/zh/output_24k.wav
--ref-text "我当然知道了"
(Note: --text and --ref-text are short and in the same language.)

❌ Error Traceback
Got reference audio with duration: 1.44 seconds Traceback (most recent call last): File "<frozen runpy>", line 198, in _run_module_as_main File "<frozen runpy>", line 88, in _run_code File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 347, in <module> generate( File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 171, in generate wave, _ = f5tts.sample( File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 308, in sample duration = self.predict_duration(cond, text, speed) File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 259, in predict_duration duration_in_sec = self._duration_predictor(cond, text) File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/duration.py", line 241, in __call__ inp = mx.where( ValueError: [broadcast_shapes] Shapes (1,34551,100) and (1,34551,2) cannot be broadcast.
ℹ️ Environment Details
Operating System: macOS (Apple Silicon)

  • Python Version: 3.11

  • Virtual Environment: Yes (f5-tts-311-venv)

  • Package Versions (Please provide your installed version of these):

    - f5-tts-mlx: `[Name: f5-tts-mlx
    

Version: 0.2.6
Summary: F5-TTS - MLX
Home-page: https://github.com/lucasnewman/f5-tts-mlx
Author:
Author-email: Lucas Newman lucasnewman@me.com
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: einops, einx, huggingface_hub, jieba, mlx, numpy, pypinyin, setuptools, sounddevice, soundfile, tqdm, vocos-mlx
Required-by:]`

  - mlx: `[Name: mlx

Version: 0.29.3
Summary: A framework for machine learning on Apple silicon.
Home-page: https://github.com/ml-explore/mlx
Author: MLX Contributors
Author-email: mlx@group.apple.com
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: mlx-metal
Required-by: f5-tts-mlx, vocos-mlx]`

💡 Troubleshooting Steps Taken
1 . Confirmed reference audio is mono, 24kHz WAV.

  1. Ensured --text and --ref-text are in the same language (Chinese).

  2. Attempted to use short text.

  3. Cleared Hugging Face cache to force re-download of model weights (duration_v2.safetensors).

Thank you for your help in resolving this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions