vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac
vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:
- MLX: Apple's ML framework with unified memory and Metal kernels
- mlx-lm: Optimized LLM inference with KV cache and quantization
- mlx-vlm: Vision-language models for multimodal inference
- mlx-audio: Speech-to-Text and Text-to-Speech with native voices
- mlx-embeddings: Text embeddings for semantic search and RAG
- Multimodal - Text, Image, Video & Audio in one platform
- Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
- Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
- OpenAI API compatible - drop-in replacement for OpenAI client
- Anthropic Messages API - native
/v1/messagesendpoint for Claude Code and OpenCode - Embeddings - OpenAI-compatible
/v1/embeddingsendpoint with mlx-embeddings - Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
- MCP Tool Calling - integrate external tools via Model Context Protocol
- Paged KV Cache - memory-efficient caching with prefix sharing
- Continuous Batching - high throughput for multiple concurrent users
Using uv (recommended):
# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git
# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.gitUsing pip:
# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git
# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000
# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching
# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-keyfrom openai import OpenAI
# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")
# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")
response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.
from anthropic import Anthropic
client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")
response = client.messages.create(
model="default",
max_tokens=256,
messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)To use with Claude Code:
export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claudeSee Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.
vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000response = client.chat.completions.create(
model="default",
messages=[{
"role": "user",
"content": [
{"type": "text", "text": "What's in this image?"},
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
]
}]
)# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng # macOS, for non-English languages# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play
# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play
# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languagesSupported TTS Models:
| Model | Languages | Description |
|---|---|---|
| Kokoro | EN, ES, FR, JA, ZH, IT, PT, HI | Fast, 82M params, 11 voices |
| Chatterbox | 15+ languages | Expressive, voice cloning |
| VibeVoice | EN | Realtime, low latency |
| VoxCPM | ZH, EN | High quality Chinese/English |
Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:
# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3response = client.chat.completions.create(
model="default",
messages=[{"role": "user", "content": "What is 17 × 23?"}]
)
# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)Supported Parsers:
| Parser | Models | Description |
|---|---|---|
qwen3 |
Qwen3 series | Requires both <think> and </think> tags |
deepseek_r1 |
DeepSeek-R1 | Handles implicit <think> tag |
Generate text embeddings for semantic search, RAG, and similarity:
# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
model="mlx-community/all-MiniLM-L6-v2-4bit",
input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")See Embeddings Guide for details on supported models and lazy loading.
For full documentation, see the docs directory:
-
Getting Started
-
User Guides
-
Reference
-
Benchmarks
┌─────────────────────────────────────────────────────────────────────────┐
│ vLLM API Layer │
│ (OpenAI-compatible interface) │
└─────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLXPlatform │
│ (vLLM platform plugin for Apple Silicon) │
└─────────────────────────────────────────────────────────────────────────┘
│
┌─────────────┬────────────┴────────────┬─────────────┐
▼ ▼ ▼ ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│ mlx-lm │ │ mlx-vlm │ │ mlx-audio │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM) │ │ (TTS + STT) │ │ (Embeddings) │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
│ │ │ │
└─────────────┴─────────────────────────┴─────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ MLX │
│ (Apple ML Framework - Metal kernels) │
└─────────────────────────────────────────────────────────────────────────┘
LLM Performance (M4 Max, 128GB):
| Model | Speed | Memory |
|---|---|---|
| Qwen3-0.6B-8bit | 402 tok/s | 0.7 GB |
| Llama-3.2-1B-4bit | 464 tok/s | 0.7 GB |
| Llama-3.2-3B-4bit | 200 tok/s | 1.8 GB |
Continuous Batching (5 concurrent requests):
| Model | Single | Batched | Speedup |
|---|---|---|---|
| Qwen3-0.6B-8bit | 328 tok/s | 1112 tok/s | 3.4x |
| Llama-3.2-1B-4bit | 299 tok/s | 613 tok/s | 2.0x |
Audio - Speech-to-Text (M4 Max, 128GB):
| Model | RTF* | Use Case |
|---|---|---|
| whisper-tiny | 197x | Real-time, low latency |
| whisper-large-v3-turbo | 55x | Best quality/speed balance |
| whisper-large-v3 | 24x | Highest accuracy |
*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.
See benchmarks for detailed results.
vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.
# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:
Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py
Find the make_cache method and replace with:
def make_cache(self):
import os
# Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
# Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))
caches = []
for i in range(self.config.num_hidden_layers):
if (
i % self.config.sliding_window_pattern
== self.config.sliding_window_pattern - 1
):
caches.append(KVCache())
elif sliding_window == 0:
caches.append(KVCache()) # Full context for all layers
else:
caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
return cachesUsage:
# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000
# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000Benchmark Results (M4 Max 128GB):
| Setting | Max Context | Memory |
|---|---|---|
| Default (1024) | ~10K tokens | ~16GB |
GEMMA3_SLIDING_WINDOW=8192 |
~40K tokens | ~25GB |
GEMMA3_SLIDING_WINDOW=0 |
~50K tokens | ~35GB |
We welcome contributions! See Contributing Guide for details.
- Bug fixes and improvements
- Performance optimizations
- Documentation improvements
- Benchmarks on different Apple Silicon chips
Submit PRs to: https://github.com/waybarrios/vllm-mlx
Apache 2.0 - see LICENSE for details.
If you use vLLM-MLX in your research or project, please cite:
@software{vllm_mlx2025,
author = {Barrios, Wayner},
title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
year = {2025},
url = {https://github.com/waybarrios/vllm-mlx},
note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}