ValueError: [broadcast_shapes] when using --ref-audio (Chinese/Multilingual input)

Hello,

I am encountering a ValueError: [broadcast_shapes] when attempting to use the zero-shot voice cloning feature (--ref-audio and --ref-text), specifically with Chinese input, even after ensuring all audio and text requirements are met.

The base TTS functionality works fine (tested without reference audio).

🐛 Bug Description
When running the synthesis command with reference audio/text, the process consistently fails at the Duration Predictor step.

🔬 Steps to Reproduce

1. Preparation (Confirmed Working):

Base TTS functionality works (English synthesis): python -m f5_tts_mlx.generate --text "The quick brown fox jumped over the lazy dog." --output ./test_default.wav (Success)

Reference audio is a mono, 24kHz WAV file (confirmed correct format).

2. Failing Command (Chinese Input):

Bash

python -m f5_tts_mlx.generate \
--text "我现在明白了。" \
--ref-audio ./demo_wav/zh/output_24k.wav \
--ref-text "我当然知道了"
(Note: --text and --ref-text are short and in the same language.)

❌ Error Traceback
`Got reference audio with duration: 1.44 seconds
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 347, in <module>
    generate(
  File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/generate.py", line 171, in generate
    wave, _ = f5tts.sample(
  File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 308, in sample
    duration = self.predict_duration(cond, text, speed)
  File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/cfm.py", line 259, in predict_duration
    duration_in_sec = self._duration_predictor(cond, text)
  File "/Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages/f5_tts_mlx/duration.py", line 241, in __call__
    inp = mx.where(
ValueError: [broadcast_shapes] Shapes (1,34551,100) and (1,34551,2) cannot be broadcast.`
ℹ️ Environment Details
Operating System: macOS (Apple Silicon)

- Python Version: 3.11

- Virtual Environment: Yes (f5-tts-311-venv)

- Package Versions (Please provide your installed version of these):

      - f5-tts-mlx: `[Name: f5-tts-mlx
Version: 0.2.6
Summary: F5-TTS - MLX
Home-page: https://github.com/lucasnewman/f5-tts-mlx
Author:
Author-email: Lucas Newman <lucasnewman@me.com>
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: einops, einx, huggingface_hub, jieba, mlx, numpy, pypinyin, setuptools, sounddevice, soundfile, tqdm, vocos-mlx
Required-by:]`

      - mlx: `[Name: mlx
Version: 0.29.3
Summary: A framework for machine learning on Apple silicon.
Home-page: https://github.com/ml-explore/mlx
Author: MLX Contributors
Author-email: mlx@group.apple.com
License: MIT
Location: /Users/yijiner/ai-era/f5-tts-311-venv/lib/python3.11/site-packages
Requires: mlx-metal
Required-by: f5-tts-mlx, vocos-mlx]`

💡 Troubleshooting Steps Taken
1 . Confirmed reference audio is mono, 24kHz WAV.

2. Ensured --text and --ref-text are in the same language (Chinese).

3. Attempted to use short text.

4. Cleared Hugging Face cache to force re-download of model weights (duration_v2.safetensors).

Thank you for your help in resolving this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ValueError: [broadcast_shapes] when using --ref-audio (Chinese/Multilingual input) #41

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

ValueError: [broadcast_shapes] when using --ref-audio (Chinese/Multilingual input) #41

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions