Skip to content

v0.2.0 - Multimodal: Text, Image, Video & Audio

Choose a tag to compare

@waybarrios waybarrios released this 06 Jan 20:58
· 96 commits to main since this release
64952fa

๐Ÿš€ What's New

vLLM-MLX now supports Text, Image, Video & Audio - all GPU-accelerated on Apple Silicon.

๐ŸŽ™๏ธ Audio Support (NEW)

  • STT (Speech-to-Text): Whisper, Parakeet
  • TTS (Text-to-Speech): Kokoro with native multilingual voices
  • Native voices: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi
  • Bug fix included for mlx-audio 0.2.9 multilingual support

๐Ÿ“ฆ Modular Architecture

Modality Library Install
Text mlx-lm pip install vllm-mlx
Image mlx-vlm pip install vllm-mlx
Video mlx-vlm pip install vllm-mlx
Audio mlx-audio pip install vllm-mlx[audio]

๐Ÿ—ฃ๏ธ Native TTS Voices

Language Voices
English af_heart, am_adam, bf_emma, bm_george + 24 more
Spanish ef_dora, em_alex, em_santa
French ff_siwis
Chinese zf_xiaobei, zm_yunjian + 6 more
Japanese jf_alpha, jm_kumo + 3 more

๐Ÿ“‹ Examples

Text (LLM Inference)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "file://video.mp4"}}
        ]
    }]
)

Text-to-Speech (Native Spanish)

python -m mlx_audio.tts.generate \
  --model mlx-community/Kokoro-82M-bf16 \
  --text "Hola, bienvenido" \
  --voice ef_dora --lang_code e

Speech-to-Text

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("audio.mp3", "rb")
)

๐Ÿ”ง Requirements

  • Apple Silicon (M1, M2, M3, M4, M5+)
  • Python 3.10+
  • macOS

Full Changelog: v0.1.0...v0.2.0