Releases · waybarrios/vllm-mlx

13 Feb 05:05

v0.2.6

0945e63

v0.2.6 Latest

Latest

This is a major feature release with tool calling, embeddings, reasoning support, Anthropic Messages API, and numerous fixes.

New Features

Tool Calling Support

Full tool calling / function calling with 12 parsers covering major model families: Mistral, DeepSeek, Granite, Nemotron, GLM-4.7, Harmony (GPT-OSS), and more. Includes native format support and streaming tool call parsing for Qwen3-Coder. (#28, #31, #42, #50, #55, #64)

Embeddings API

OpenAI-compatible /v1/embeddings endpoint powered by mlx-embeddings. (#48)

Reasoning Parser Support

Chain-of-thought reasoning output parsing for supported models. (#33)

Anthropic Messages API

Native /v1/messages endpoint compatible with the Anthropic API format, enabling agentic workflows and multi-turn conversations. (#46)

Prefix Cache Improvements

Improved prefix caching reliability with fixes for QuantizedKVCache offset trimming and agentic multi-turn scenarios. (#46, #69)

CLI Enhancements

--default-temperature and --default-top-p arguments for server defaults (#41)
uv installation instructions added to docs (#32)

Bug Fixes

Fix MLLM broadcast error on concurrent requests (#49)
Fix compatibility with mlx-lm 0.30.6 — MambaCache removal (#40)
Fix MLLM continuous batching: system prompt routing and KV cache handling (#76)
Fix streaming tool call parsing for Qwen3-Coder models (#55)
Fix benchmark test scripts for reproducible runs (#74)
Fix _trim_cache_offset for QuantizedKVCache layers (#69)
Add MedGemma to MLLM detection patterns and fix --mllm flag (#22)
Add missing pytz dependency for vllm-mlx-chat (#24)
Bump mlx-lm minimum to 0.30.5 for GLM-4 model support (#79)
Bump transformers minimum to 5.0.0 (required by mlx-lm 0.30.5+)

CI / Testing

Expanded CI to 315 tests across Python 3.10–3.12
Added Apple Silicon (M1) test job on macos-14

Thanks

Thanks to @janhilgard, @Chida82, @camerhann, @selimrecep, and @karmawastakenalready for their contributions!

Contributors

Chida82, selimrecep, and 3 other contributors

Assets 2

26 Jan 20:08

waybarrios

v0.2.5

651cb0d

v0.2.5 - MLLM Prefix Cache & Continuous Batching

This release brings significant performance improvements for multimodal models through prefix caching and continuous batching support.

What's New

Prefix Cache for Multimodal Models

When you send the same image multiple times (like in a multi-turn conversation), the vision encoder normally has to process it again each time. Now, vllm-mlx caches the vision embeddings and KV states, so subsequent requests with the same image skip the encoder entirely.

Real-world impact: On a Qwen3-VL-30B model, response time drops from 21 seconds to under 1 second after the first request with an image. That's a 20x speedup for follow-up questions about the same image.

Continuous Batching for Vision Models

Vision-language models like Qwen3-VL and Gemma 3 now work with the continuous batching engine, allowing better throughput when handling multiple concurrent requests.

Gemma 3 Support

Gemma 3 is now automatically detected as a multimodal model and works out of the box.

Real Token Streaming

The stream_chat() method now provides true token-by-token streaming instead of waiting for chunks, giving a more responsive experience.

Benchmarks (M4 Max, 128GB)

Model	Resolution	Speed	Memory
Qwen3-VL-4B-3bit	224x224	143 tok/s	2.6 GB
Qwen3-VL-8B-4bit	224x224	73 tok/s	5.6 GB
Gemma 3 4B-4bit	224x224	32 tok/s	5.2 GB

Cache Performance

Request	Without Cache	With Cache	Speedup
1st (cold)	21.7s	21.7s	-
2nd+ (cached)	21.7s	~0.8s	28x

Usage

# Enable prefix cache
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --enable-mllm-cache

# With custom cache size
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit \
    --enable-mllm-cache \
    --mllm-cache-max-mb 1024

Thanks

Thanks to @lubauss for the original continuous batching implementation that this release builds upon.

Contributors

lubauss

Assets 2

23 Jan 17:27

waybarrios

v0.2.4

d5bb45d

v0.2.4 - Dependency Fix

What's New

Dependency conflict resolved

Fixed pip installation error when mlx-audio conflicted with mlx-lm version requirements.

Changes:

mlx-audio moved to optional dependencies (install with pip install vllm-mlx[audio])
Removed transformers<5.0.0 constraint for mlx-lm 0.30.2+ compatibility

Install

pip install vllm-mlx==0.2.4

# With audio support
pip install vllm-mlx[audio]==0.2.4

Fixes #19

Assets 2

22 Jan 23:24

waybarrios

v0.2.3

d340cc7

v0.2.3

What's New

Stability improvements for continuous batching

This release fixes two issues reported in #16 when running with --continuous-batching:

Metal crash fix - Added proper synchronization for MLX operations. Previously, concurrent requests could cause Metal command buffer conflicts resulting in crashes. Now requests are serialized internally to prevent this.

Memory management - The prefix cache now tracks actual memory usage instead of just counting entries. For large models like GLM-4.7-Flash, the old behavior could quickly exhaust RAM since each cached entry can be hundreds of MB.

The new memory-aware cache:

Auto-detects available system RAM (uses 20% by default)
Evicts least-recently-used entries when memory limit is reached
Avoids unnecessary copies since MLX arrays are immutable

New CLI options

# Set explicit memory limit
vllm-mlx serve model --continuous-batching --cache-memory-mb 2048

# Use percentage of available RAM
vllm-mlx serve model --continuous-batching --cache-memory-percent 0.10

# Disable memory-aware cache (use legacy behavior)
vllm-mlx serve model --continuous-batching --no-memory-aware-cache

Other changes

Updated mlx-lm requirement to >=0.30.2 for GLM-4 model support
Cleaner test output

Assets 2

16 Jan 04:05

waybarrios

v0.2.1

a613b6e

v0.2.1 - Security and Reliability

What's New

Security

Timing attack prevention with secrets.compare_digest() for API key verification
Rate limiting support with sliding window algorithm (--rate-limit flag)
Request timeout to prevent resource exhaustion (--timeout flag)

Reliability

TempFileManager for automatic cleanup of temporary files (images/videos)
Thread-safe _waiting_consumers counter in RequestOutputCollector
Fixed asyncio timeout by running model calls in thread pool

API Changes

New timeout parameter in ChatCompletionRequest and CompletionRequest
New CLI args: --timeout and --rate-limit

Install

pip install vllm-mlx==0.2.1

Assets 2

06 Jan 20:58

waybarrios

v0.2.0

64952fa

v0.2.0 - Multimodal: Text, Image, Video & Audio

🚀 What's New

vLLM-MLX now supports Text, Image, Video & Audio - all GPU-accelerated on Apple Silicon.

🎙️ Audio Support (NEW)

STT (Speech-to-Text): Whisper, Parakeet
TTS (Text-to-Speech): Kokoro with native multilingual voices
Native voices: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi
Bug fix included for mlx-audio 0.2.9 multilingual support

📦 Modular Architecture

Modality	Library	Install
Text	mlx-lm	`pip install vllm-mlx`
Image	mlx-vlm	`pip install vllm-mlx`
Video	mlx-vlm	`pip install vllm-mlx`
Audio	mlx-audio	`pip install vllm-mlx[audio]`

🗣️ Native TTS Voices

Language	Voices
English	af_heart, am_adam, bf_emma, bm_george + 24 more
Spanish	ef_dora, em_alex, em_santa
French	ff_siwis
Chinese	zf_xiaobei, zm_yunjian + 6 more
Japanese	jf_alpha, jm_kumo + 3 more

📋 Examples

Text (LLM Inference)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "file://video.mp4"}}
        ]
    }]
)

Text-to-Speech (Native Spanish)

python -m mlx_audio.tts.generate \
  --model mlx-community/Kokoro-82M-bf16 \
  --text "Hola, bienvenido" \
  --voice ef_dora --lang_code e

Speech-to-Text

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("audio.mp3", "rb")
)

🔧 Requirements

Apple Silicon (M1, M2, M3, M4, M5+)
Python 3.10+
macOS

Full Changelog: v0.1.0...v0.2.0

Assets 2

06 Jan 20:56

waybarrios

v0.1.0

4e01e67

v0.1.0 - Text, Image & Video Support

🚀 Initial Release

vLLM-MLX brings vLLM-like inference to Apple Silicon with native GPU acceleration.

✨ Features

Text (LLM) - Fast inference with mlx-lm
Image Understanding - Vision-language models with mlx-vlm
Video Understanding - Multi-frame video analysis
Continuous Batching - High throughput for multiple concurrent users
Paged KV Cache - Memory-efficient caching with prefix sharing
MCP Tool Calling - Integrate external tools via Model Context Protocol
Structured Output - JSON mode and JSON Schema validation
OpenAI API Compatible - Drop-in replacement for OpenAI client

📦 Architecture

Component	Library
LLM Inference	mlx-lm
Vision Models	mlx-vlm
Framework	MLX

📋 Examples

Text (LLM)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}]
)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "video.mp4"}}
        ]
    }]
)

⚡ Performance (M4 Max, 128GB)

Model	Speed	Memory
Qwen3-0.6B-8bit	402 tok/s	0.7 GB
Llama-3.2-1B-4bit	464 tok/s	0.7 GB
Llama-3.2-3B-4bit	200 tok/s	1.8 GB

Continuous Batching (5 concurrent requests):

Model	Single	Batched	Speedup
Qwen3-0.6B-8bit	328 tok/s	1112 tok/s	3.4x

🔧 Requirements

Apple Silicon (M1, M2, M3, M4, M5+)
Python 3.10+
macOS

📥 Install

pip install vllm-mlx

Assets 2

Releases: waybarrios/vllm-mlx

v0.2.6

New Features

Tool Calling Support

Embeddings API

Reasoning Parser Support

Anthropic Messages API

Prefix Cache Improvements

CLI Enhancements

Bug Fixes

CI / Testing

Thanks

Contributors

Uh oh!

v0.2.5 - MLLM Prefix Cache & Continuous Batching

What's New

Prefix Cache for Multimodal Models

Continuous Batching for Vision Models

Gemma 3 Support

Real Token Streaming

Benchmarks (M4 Max, 128GB)

Cache Performance

Usage

Thanks

Contributors

Uh oh!

v0.2.4 - Dependency Fix

What's New

Dependency conflict resolved

Install

Uh oh!

v0.2.3

What's New

Stability improvements for continuous batching

New CLI options

Other changes

Uh oh!

v0.2.1 - Security and Reliability

What's New

Security

Reliability

API Changes

Install

Uh oh!

v0.2.0 - Multimodal: Text, Image, Video & Audio

🚀 What's New

🎙️ Audio Support (NEW)

📦 Modular Architecture

🗣️ Native TTS Voices

📋 Examples

🔧 Requirements

Uh oh!

v0.1.0 - Text, Image & Video Support

🚀 Initial Release

✨ Features

📦 Architecture

📋 Examples

⚡ Performance (M4 Max, 128GB)

🔧 Requirements

📥 Install

Uh oh!