Release v0.2.0 - Multimodal: Text, Image, Video & Audio · waybarrios/vllm-mlx

🚀 What's New

vLLM-MLX now supports Text, Image, Video & Audio - all GPU-accelerated on Apple Silicon.

🎙️ Audio Support (NEW)

STT (Speech-to-Text): Whisper, Parakeet
TTS (Text-to-Speech): Kokoro with native multilingual voices
Native voices: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi
Bug fix included for mlx-audio 0.2.9 multilingual support

📦 Modular Architecture

Modality	Library	Install
Text	mlx-lm	`pip install vllm-mlx`
Image	mlx-vlm	`pip install vllm-mlx`
Video	mlx-vlm	`pip install vllm-mlx`
Audio	mlx-audio	`pip install vllm-mlx[audio]`

🗣️ Native TTS Voices

Language	Voices
English	af_heart, am_adam, bf_emma, bm_george + 24 more
Spanish	ef_dora, em_alex, em_santa
French	ff_siwis
Chinese	zf_xiaobei, zm_yunjian + 6 more
Japanese	jf_alpha, jm_kumo + 3 more

📋 Examples

Text (LLM Inference)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "file://video.mp4"}}
        ]
    }]
)

Text-to-Speech (Native Spanish)

python -m mlx_audio.tts.generate \
  --model mlx-community/Kokoro-82M-bf16 \
  --text "Hola, bienvenido" \
  --voice ef_dora --lang_code e

Speech-to-Text

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("audio.mp3", "rb")
)

🔧 Requirements

Apple Silicon (M1, M2, M3, M4, M5+)
Python 3.10+
macOS

Full Changelog: v0.1.0...v0.2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.2.0 - Multimodal: Text, Image, Video & Audio

Choose a tag to compare

Sorry, something went wrong.