Releases: waybarrios/vllm-mlx
v0.2.6
This is a major feature release with tool calling, embeddings, reasoning support, Anthropic Messages API, and numerous fixes.
New Features
Tool Calling Support
Full tool calling / function calling with 12 parsers covering major model families: Mistral, DeepSeek, Granite, Nemotron, GLM-4.7, Harmony (GPT-OSS), and more. Includes native format support and streaming tool call parsing for Qwen3-Coder. (#28, #31, #42, #50, #55, #64)
Embeddings API
OpenAI-compatible /v1/embeddings endpoint powered by mlx-embeddings. (#48)
Reasoning Parser Support
Chain-of-thought reasoning output parsing for supported models. (#33)
Anthropic Messages API
Native /v1/messages endpoint compatible with the Anthropic API format, enabling agentic workflows and multi-turn conversations. (#46)
Prefix Cache Improvements
Improved prefix caching reliability with fixes for QuantizedKVCache offset trimming and agentic multi-turn scenarios. (#46, #69)
CLI Enhancements
--default-temperatureand--default-top-parguments for server defaults (#41)uvinstallation instructions added to docs (#32)
Bug Fixes
- Fix MLLM broadcast error on concurrent requests (#49)
- Fix compatibility with mlx-lm 0.30.6 — MambaCache removal (#40)
- Fix MLLM continuous batching: system prompt routing and KV cache handling (#76)
- Fix streaming tool call parsing for Qwen3-Coder models (#55)
- Fix benchmark test scripts for reproducible runs (#74)
- Fix
_trim_cache_offsetfor QuantizedKVCache layers (#69) - Add MedGemma to MLLM detection patterns and fix
--mllmflag (#22) - Add missing
pytzdependency forvllm-mlx-chat(#24) - Bump mlx-lm minimum to 0.30.5 for GLM-4 model support (#79)
- Bump transformers minimum to 5.0.0 (required by mlx-lm 0.30.5+)
CI / Testing
- Expanded CI to 315 tests across Python 3.10–3.12
- Added Apple Silicon (M1) test job on macos-14
Thanks
Thanks to @janhilgard, @Chida82, @camerhann, @selimrecep, and @karmawastakenalready for their contributions!
v0.2.5 - MLLM Prefix Cache & Continuous Batching
This release brings significant performance improvements for multimodal models through prefix caching and continuous batching support.
What's New
Prefix Cache for Multimodal Models
When you send the same image multiple times (like in a multi-turn conversation), the vision encoder normally has to process it again each time. Now, vllm-mlx caches the vision embeddings and KV states, so subsequent requests with the same image skip the encoder entirely.
Real-world impact: On a Qwen3-VL-30B model, response time drops from 21 seconds to under 1 second after the first request with an image. That's a 20x speedup for follow-up questions about the same image.
Continuous Batching for Vision Models
Vision-language models like Qwen3-VL and Gemma 3 now work with the continuous batching engine, allowing better throughput when handling multiple concurrent requests.
Gemma 3 Support
Gemma 3 is now automatically detected as a multimodal model and works out of the box.
Real Token Streaming
The stream_chat() method now provides true token-by-token streaming instead of waiting for chunks, giving a more responsive experience.
Benchmarks (M4 Max, 128GB)
| Model | Resolution | Speed | Memory |
|---|---|---|---|
| Qwen3-VL-4B-3bit | 224x224 | 143 tok/s | 2.6 GB |
| Qwen3-VL-8B-4bit | 224x224 | 73 tok/s | 5.6 GB |
| Gemma 3 4B-4bit | 224x224 | 32 tok/s | 5.2 GB |
Cache Performance
| Request | Without Cache | With Cache | Speedup |
|---|---|---|---|
| 1st (cold) | 21.7s | 21.7s | - |
| 2nd+ (cached) | 21.7s | ~0.8s | 28x |
Usage
# Enable prefix cache
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --enable-mllm-cache
# With custom cache size
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit \
--enable-mllm-cache \
--mllm-cache-max-mb 1024Thanks
Thanks to @lubauss for the original continuous batching implementation that this release builds upon.
v0.2.4 - Dependency Fix
What's New
Dependency conflict resolved
Fixed pip installation error when mlx-audio conflicted with mlx-lm version requirements.
Changes:
- mlx-audio moved to optional dependencies (install with
pip install vllm-mlx[audio]) - Removed transformers<5.0.0 constraint for mlx-lm 0.30.2+ compatibility
Install
pip install vllm-mlx==0.2.4
# With audio support
pip install vllm-mlx[audio]==0.2.4Fixes #19
v0.2.3
What's New
Stability improvements for continuous batching
This release fixes two issues reported in #16 when running with --continuous-batching:
Metal crash fix - Added proper synchronization for MLX operations. Previously, concurrent requests could cause Metal command buffer conflicts resulting in crashes. Now requests are serialized internally to prevent this.
Memory management - The prefix cache now tracks actual memory usage instead of just counting entries. For large models like GLM-4.7-Flash, the old behavior could quickly exhaust RAM since each cached entry can be hundreds of MB.
The new memory-aware cache:
- Auto-detects available system RAM (uses 20% by default)
- Evicts least-recently-used entries when memory limit is reached
- Avoids unnecessary copies since MLX arrays are immutable
New CLI options
# Set explicit memory limit
vllm-mlx serve model --continuous-batching --cache-memory-mb 2048
# Use percentage of available RAM
vllm-mlx serve model --continuous-batching --cache-memory-percent 0.10
# Disable memory-aware cache (use legacy behavior)
vllm-mlx serve model --continuous-batching --no-memory-aware-cacheOther changes
- Updated mlx-lm requirement to >=0.30.2 for GLM-4 model support
- Cleaner test output
v0.2.1 - Security and Reliability
What's New
Security
- Timing attack prevention with
secrets.compare_digest()for API key verification - Rate limiting support with sliding window algorithm (
--rate-limitflag) - Request timeout to prevent resource exhaustion (
--timeoutflag)
Reliability
- TempFileManager for automatic cleanup of temporary files (images/videos)
- Thread-safe
_waiting_consumerscounter in RequestOutputCollector - Fixed asyncio timeout by running model calls in thread pool
API Changes
- New
timeoutparameter in ChatCompletionRequest and CompletionRequest - New CLI args:
--timeoutand--rate-limit
Install
pip install vllm-mlx==0.2.1v0.2.0 - Multimodal: Text, Image, Video & Audio
🚀 What's New
vLLM-MLX now supports Text, Image, Video & Audio - all GPU-accelerated on Apple Silicon.
🎙️ Audio Support (NEW)
- STT (Speech-to-Text): Whisper, Parakeet
- TTS (Text-to-Speech): Kokoro with native multilingual voices
- Native voices: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi
- Bug fix included for mlx-audio 0.2.9 multilingual support
📦 Modular Architecture
| Modality | Library | Install |
|---|---|---|
| Text | mlx-lm | pip install vllm-mlx |
| Image | mlx-vlm | pip install vllm-mlx |
| Video | mlx-vlm | pip install vllm-mlx |
| Audio | mlx-audio | pip install vllm-mlx[audio] |
🗣️ Native TTS Voices
| Language | Voices |
|---|---|
| English | af_heart, am_adam, bf_emma, bm_george + 24 more |
| Spanish | ef_dora, em_alex, em_santa |
| French | ff_siwis |
| Chinese | zf_xiaobei, zm_yunjian + 6 more |
| Japanese | jf_alpha, jm_kumo + 3 more |
📋 Examples
Text (LLM Inference)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)Image Understanding
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
)Video Understanding
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this video"},
{"type": "video_url", "video_url": {"url": "file://video.mp4"}}
]
}]
)Text-to-Speech (Native Spanish)
python -m mlx_audio.tts.generate \
--model mlx-community/Kokoro-82M-bf16 \
--text "Hola, bienvenido" \
--voice ef_dora --lang_code eSpeech-to-Text
transcript = client.audio.transcriptions.create(
model="whisper-large-v3",
file=open("audio.mp3", "rb")
)🔧 Requirements
- Apple Silicon (M1, M2, M3, M4, M5+)
- Python 3.10+
- macOS
Full Changelog: v0.1.0...v0.2.0
v0.1.0 - Text, Image & Video Support
🚀 Initial Release
vLLM-MLX brings vLLM-like inference to Apple Silicon with native GPU acceleration.
✨ Features
- Text (LLM) - Fast inference with mlx-lm
- Image Understanding - Vision-language models with mlx-vlm
- Video Understanding - Multi-frame video analysis
- Continuous Batching - High throughput for multiple concurrent users
- Paged KV Cache - Memory-efficient caching with prefix sharing
- MCP Tool Calling - Integrate external tools via Model Context Protocol
- Structured Output - JSON mode and JSON Schema validation
- OpenAI API Compatible - Drop-in replacement for OpenAI client
📦 Architecture
| Component | Library |
|---|---|
| LLM Inference | mlx-lm |
| Vision Models | mlx-vlm |
| Framework | MLX |
📋 Examples
Text (LLM)
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}]
)Image Understanding
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "image.jpg"}}
]
}]
)Video Understanding
response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "Describe this video"},
{"type": "video_url", "video_url": {"url": "video.mp4"}}
]
}]
)⚡ Performance (M4 Max, 128GB)
| Model | Speed | Memory |
|---|---|---|
| Qwen3-0.6B-8bit | 402 tok/s | 0.7 GB |
| Llama-3.2-1B-4bit | 464 tok/s | 0.7 GB |
| Llama-3.2-3B-4bit | 200 tok/s | 1.8 GB |
Continuous Batching (5 concurrent requests):
| Model | Single | Batched | Speedup |
|---|---|---|---|
| Qwen3-0.6B-8bit | 328 tok/s | 1112 tok/s | 3.4x |
🔧 Requirements
- Apple Silicon (M1, M2, M3, M4, M5+)
- Python 3.10+
- macOS
📥 Install
pip install vllm-mlx