Skip to content

Releases: waybarrios/vllm-mlx

v0.2.6

13 Feb 05:05
0945e63

Choose a tag to compare

This is a major feature release with tool calling, embeddings, reasoning support, Anthropic Messages API, and numerous fixes.

New Features

Tool Calling Support

Full tool calling / function calling with 12 parsers covering major model families: Mistral, DeepSeek, Granite, Nemotron, GLM-4.7, Harmony (GPT-OSS), and more. Includes native format support and streaming tool call parsing for Qwen3-Coder. (#28, #31, #42, #50, #55, #64)

Embeddings API

OpenAI-compatible /v1/embeddings endpoint powered by mlx-embeddings. (#48)

Reasoning Parser Support

Chain-of-thought reasoning output parsing for supported models. (#33)

Anthropic Messages API

Native /v1/messages endpoint compatible with the Anthropic API format, enabling agentic workflows and multi-turn conversations. (#46)

Prefix Cache Improvements

Improved prefix caching reliability with fixes for QuantizedKVCache offset trimming and agentic multi-turn scenarios. (#46, #69)

CLI Enhancements

  • --default-temperature and --default-top-p arguments for server defaults (#41)
  • uv installation instructions added to docs (#32)

Bug Fixes

  • Fix MLLM broadcast error on concurrent requests (#49)
  • Fix compatibility with mlx-lm 0.30.6 — MambaCache removal (#40)
  • Fix MLLM continuous batching: system prompt routing and KV cache handling (#76)
  • Fix streaming tool call parsing for Qwen3-Coder models (#55)
  • Fix benchmark test scripts for reproducible runs (#74)
  • Fix _trim_cache_offset for QuantizedKVCache layers (#69)
  • Add MedGemma to MLLM detection patterns and fix --mllm flag (#22)
  • Add missing pytz dependency for vllm-mlx-chat (#24)
  • Bump mlx-lm minimum to 0.30.5 for GLM-4 model support (#79)
  • Bump transformers minimum to 5.0.0 (required by mlx-lm 0.30.5+)

CI / Testing

  • Expanded CI to 315 tests across Python 3.10–3.12
  • Added Apple Silicon (M1) test job on macos-14

Thanks

Thanks to @janhilgard, @Chida82, @camerhann, @selimrecep, and @karmawastakenalready for their contributions!

v0.2.5 - MLLM Prefix Cache & Continuous Batching

26 Jan 20:08
651cb0d

Choose a tag to compare

This release brings significant performance improvements for multimodal models through prefix caching and continuous batching support.

What's New

Prefix Cache for Multimodal Models

When you send the same image multiple times (like in a multi-turn conversation), the vision encoder normally has to process it again each time. Now, vllm-mlx caches the vision embeddings and KV states, so subsequent requests with the same image skip the encoder entirely.

Real-world impact: On a Qwen3-VL-30B model, response time drops from 21 seconds to under 1 second after the first request with an image. That's a 20x speedup for follow-up questions about the same image.

Continuous Batching for Vision Models

Vision-language models like Qwen3-VL and Gemma 3 now work with the continuous batching engine, allowing better throughput when handling multiple concurrent requests.

Gemma 3 Support

Gemma 3 is now automatically detected as a multimodal model and works out of the box.

Real Token Streaming

The stream_chat() method now provides true token-by-token streaming instead of waiting for chunks, giving a more responsive experience.

Benchmarks (M4 Max, 128GB)

Model Resolution Speed Memory
Qwen3-VL-4B-3bit 224x224 143 tok/s 2.6 GB
Qwen3-VL-8B-4bit 224x224 73 tok/s 5.6 GB
Gemma 3 4B-4bit 224x224 32 tok/s 5.2 GB

Cache Performance

Request Without Cache With Cache Speedup
1st (cold) 21.7s 21.7s -
2nd+ (cached) 21.7s ~0.8s 28x

Usage

# Enable prefix cache
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --enable-mllm-cache

# With custom cache size
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit \
    --enable-mllm-cache \
    --mllm-cache-max-mb 1024

Thanks

Thanks to @lubauss for the original continuous batching implementation that this release builds upon.

v0.2.4 - Dependency Fix

23 Jan 17:27
d5bb45d

Choose a tag to compare

What's New

Dependency conflict resolved

Fixed pip installation error when mlx-audio conflicted with mlx-lm version requirements.

Changes:

  • mlx-audio moved to optional dependencies (install with pip install vllm-mlx[audio])
  • Removed transformers<5.0.0 constraint for mlx-lm 0.30.2+ compatibility

Install

pip install vllm-mlx==0.2.4

# With audio support
pip install vllm-mlx[audio]==0.2.4

Fixes #19

v0.2.3

22 Jan 23:24
d340cc7

Choose a tag to compare

What's New

Stability improvements for continuous batching

This release fixes two issues reported in #16 when running with --continuous-batching:

Metal crash fix - Added proper synchronization for MLX operations. Previously, concurrent requests could cause Metal command buffer conflicts resulting in crashes. Now requests are serialized internally to prevent this.

Memory management - The prefix cache now tracks actual memory usage instead of just counting entries. For large models like GLM-4.7-Flash, the old behavior could quickly exhaust RAM since each cached entry can be hundreds of MB.

The new memory-aware cache:

  • Auto-detects available system RAM (uses 20% by default)
  • Evicts least-recently-used entries when memory limit is reached
  • Avoids unnecessary copies since MLX arrays are immutable

New CLI options

# Set explicit memory limit
vllm-mlx serve model --continuous-batching --cache-memory-mb 2048

# Use percentage of available RAM
vllm-mlx serve model --continuous-batching --cache-memory-percent 0.10

# Disable memory-aware cache (use legacy behavior)
vllm-mlx serve model --continuous-batching --no-memory-aware-cache

Other changes

  • Updated mlx-lm requirement to >=0.30.2 for GLM-4 model support
  • Cleaner test output

v0.2.1 - Security and Reliability

16 Jan 04:05
a613b6e

Choose a tag to compare

What's New

Security

  • Timing attack prevention with secrets.compare_digest() for API key verification
  • Rate limiting support with sliding window algorithm (--rate-limit flag)
  • Request timeout to prevent resource exhaustion (--timeout flag)

Reliability

  • TempFileManager for automatic cleanup of temporary files (images/videos)
  • Thread-safe _waiting_consumers counter in RequestOutputCollector
  • Fixed asyncio timeout by running model calls in thread pool

API Changes

  • New timeout parameter in ChatCompletionRequest and CompletionRequest
  • New CLI args: --timeout and --rate-limit

Install

pip install vllm-mlx==0.2.1

v0.2.0 - Multimodal: Text, Image, Video & Audio

06 Jan 20:58
64952fa

Choose a tag to compare

🚀 What's New

vLLM-MLX now supports Text, Image, Video & Audio - all GPU-accelerated on Apple Silicon.

🎙️ Audio Support (NEW)

  • STT (Speech-to-Text): Whisper, Parakeet
  • TTS (Text-to-Speech): Kokoro with native multilingual voices
  • Native voices: English, Spanish, French, Chinese, Japanese, Italian, Portuguese, Hindi
  • Bug fix included for mlx-audio 0.2.9 multilingual support

📦 Modular Architecture

Modality Library Install
Text mlx-lm pip install vllm-mlx
Image mlx-vlm pip install vllm-mlx
Video mlx-vlm pip install vllm-mlx
Audio mlx-audio pip install vllm-mlx[audio]

🗣️ Native TTS Voices

Language Voices
English af_heart, am_adam, bf_emma, bm_george + 24 more
Spanish ef_dora, em_alex, em_santa
French ff_siwis
Chinese zf_xiaobei, zm_yunjian + 6 more
Japanese jf_alpha, jm_kumo + 3 more

📋 Examples

Text (LLM Inference)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Explain quantum computing"}]
)
print(response.choices[0].message.content)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "file://video.mp4"}}
        ]
    }]
)

Text-to-Speech (Native Spanish)

python -m mlx_audio.tts.generate \
  --model mlx-community/Kokoro-82M-bf16 \
  --text "Hola, bienvenido" \
  --voice ef_dora --lang_code e

Speech-to-Text

transcript = client.audio.transcriptions.create(
    model="whisper-large-v3",
    file=open("audio.mp3", "rb")
)

🔧 Requirements

  • Apple Silicon (M1, M2, M3, M4, M5+)
  • Python 3.10+
  • macOS

Full Changelog: v0.1.0...v0.2.0

v0.1.0 - Text, Image & Video Support

06 Jan 20:56
4e01e67

Choose a tag to compare

🚀 Initial Release

vLLM-MLX brings vLLM-like inference to Apple Silicon with native GPU acceleration.

✨ Features

  • Text (LLM) - Fast inference with mlx-lm
  • Image Understanding - Vision-language models with mlx-vlm
  • Video Understanding - Multi-frame video analysis
  • Continuous Batching - High throughput for multiple concurrent users
  • Paged KV Cache - Memory-efficient caching with prefix sharing
  • MCP Tool Calling - Integrate external tools via Model Context Protocol
  • Structured Output - JSON mode and JSON Schema validation
  • OpenAI API Compatible - Drop-in replacement for OpenAI client

📦 Architecture

Component Library
LLM Inference mlx-lm
Vision Models mlx-vlm
Framework MLX

📋 Examples

Text (LLM)

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}]
)

Image Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "image.jpg"}}
        ]
    }]
)

Video Understanding

response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "Describe this video"},
            {"type": "video_url", "video_url": {"url": "video.mp4"}}
        ]
    }]
)

⚡ Performance (M4 Max, 128GB)

Model Speed Memory
Qwen3-0.6B-8bit 402 tok/s 0.7 GB
Llama-3.2-1B-4bit 464 tok/s 0.7 GB
Llama-3.2-3B-4bit 200 tok/s 1.8 GB

Continuous Batching (5 concurrent requests):

Model Single Batched Speedup
Qwen3-0.6B-8bit 328 tok/s 1112 tok/s 3.4x

🔧 Requirements

  • Apple Silicon (M1, M2, M3, M4, M5+)
  • Python 3.10+
  • macOS

📥 Install

pip install vllm-mlx