Skip to content

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

Notifications You must be signed in to change notification settings

waybarrios/vllm-mlx

Repository files navigation

vLLM-MLX

vLLM-like inference for Apple Silicon - GPU-accelerated Text, Image, Video & Audio on Mac

License Python 3.10+ Apple Silicon GitHub

Overview

vllm-mlx brings native Apple Silicon GPU acceleration to vLLM by integrating:

  • MLX: Apple's ML framework with unified memory and Metal kernels
  • mlx-lm: Optimized LLM inference with KV cache and quantization
  • mlx-vlm: Vision-language models for multimodal inference
  • mlx-audio: Speech-to-Text and Text-to-Speech with native voices
  • mlx-embeddings: Text embeddings for semantic search and RAG

Features

  • Multimodal - Text, Image, Video & Audio in one platform
  • Native GPU acceleration on Apple Silicon (M1, M2, M3, M4)
  • Native TTS voices - Spanish, French, Chinese, Japanese + 5 more languages
  • OpenAI API compatible - drop-in replacement for OpenAI client
  • Anthropic Messages API - native /v1/messages endpoint for Claude Code and OpenCode
  • Embeddings - OpenAI-compatible /v1/embeddings endpoint with mlx-embeddings
  • Reasoning Models - extract thinking process from Qwen3, DeepSeek-R1
  • MCP Tool Calling - integrate external tools via Model Context Protocol
  • Paged KV Cache - memory-efficient caching with prefix sharing
  • Continuous Batching - high throughput for multiple concurrent users

Quick Start

Installation

Using uv (recommended):

# Install as CLI tool (system-wide)
uv tool install git+https://github.com/waybarrios/vllm-mlx.git

# Or install in a project/virtual environment
uv pip install git+https://github.com/waybarrios/vllm-mlx.git

Using pip:

# Install from GitHub
pip install git+https://github.com/waybarrios/vllm-mlx.git

# Or clone and install in development mode
git clone https://github.com/waybarrios/vllm-mlx.git
cd vllm-mlx
pip install -e .

Start Server

# Simple mode (single user, max throughput)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000

# Continuous batching (multiple users)
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --continuous-batching

# With API key authentication
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --port 8000 --api-key your-secret-key

Use with OpenAI SDK

from openai import OpenAI

# Without API key (local development)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="not-needed")

# With API key (production)
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your-secret-key")

response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "Hello!"}],
)
print(response.choices[0].message.content)

Use with Anthropic SDK

vllm-mlx exposes an Anthropic-compatible /v1/messages endpoint, so tools like Claude Code and OpenCode can connect directly.

from anthropic import Anthropic

client = Anthropic(base_url="http://localhost:8000", api_key="not-needed")

response = client.messages.create(
    model="default",
    max_tokens=256,
    messages=[{"role": "user", "content": "Hello!"}]
)
print(response.content[0].text)

To use with Claude Code:

export ANTHROPIC_BASE_URL=http://localhost:8000
export ANTHROPIC_API_KEY=not-needed
claude

See Anthropic Messages API docs for streaming, tool calling, system messages, and token counting.

Multimodal (Images & Video)

vllm-mlx serve mlx-community/Qwen3-VL-4B-Instruct-3bit --port 8000
response = client.chat.completions.create(
    model="default",
    messages=[{
        "role": "user",
        "content": [
            {"type": "text", "text": "What's in this image?"},
            {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
        ]
    }]
)

Audio (TTS/STT)

# Install audio dependencies
pip install vllm-mlx[audio]
python -m spacy download en_core_web_sm
brew install espeak-ng  # macOS, for non-English languages
# Text-to-Speech (English)
python examples/tts_example.py "Hello, how are you?" --play

# Text-to-Speech (Spanish)
python examples/tts_multilingual.py "Hola mundo" --lang es --play

# List available models and languages
python examples/tts_multilingual.py --list-models
python examples/tts_multilingual.py --list-languages

Supported TTS Models:

Model Languages Description
Kokoro EN, ES, FR, JA, ZH, IT, PT, HI Fast, 82M params, 11 voices
Chatterbox 15+ languages Expressive, voice cloning
VibeVoice EN Realtime, low latency
VoxCPM ZH, EN High quality Chinese/English

Reasoning Models

Extract the thinking process from reasoning models like Qwen3 and DeepSeek-R1:

# Start server with reasoning parser
vllm-mlx serve mlx-community/Qwen3-8B-4bit --reasoning-parser qwen3
response = client.chat.completions.create(
    model="default",
    messages=[{"role": "user", "content": "What is 17 × 23?"}]
)

# Access reasoning separately from the answer
print("Thinking:", response.choices[0].message.reasoning)
print("Answer:", response.choices[0].message.content)

Supported Parsers:

Parser Models Description
qwen3 Qwen3 series Requires both <think> and </think> tags
deepseek_r1 DeepSeek-R1 Handles implicit <think> tag

Embeddings

Generate text embeddings for semantic search, RAG, and similarity:

# Start server with an embedding model pre-loaded
vllm-mlx serve mlx-community/Llama-3.2-3B-Instruct-4bit --embedding-model mlx-community/all-MiniLM-L6-v2-4bit
# Generate embeddings using the OpenAI SDK
embeddings = client.embeddings.create(
    model="mlx-community/all-MiniLM-L6-v2-4bit",
    input=["Hello world", "How are you?"]
)
print(f"Dimensions: {len(embeddings.data[0].embedding)}")

See Embeddings Guide for details on supported models and lazy loading.

Documentation

For full documentation, see the docs directory:

Architecture

┌─────────────────────────────────────────────────────────────────────────┐
│                           vLLM API Layer                                │
│                    (OpenAI-compatible interface)                         │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                            MLXPlatform                                  │
│               (vLLM platform plugin for Apple Silicon)                  │
└─────────────────────────────────────────────────────────────────────────┘
                                   │
        ┌─────────────┬────────────┴────────────┬─────────────┐
        ▼             ▼                         ▼             ▼
┌───────────────┐ ┌───────────────┐ ┌───────────────┐ ┌───────────────┐
│    mlx-lm     │ │   mlx-vlm     │ │   mlx-audio   │ │mlx-embeddings │
│(LLM inference)│ │ (Vision+LLM)  │ │  (TTS + STT)  │ │ (Embeddings)  │
└───────────────┘ └───────────────┘ └───────────────┘ └───────────────┘
        │             │                         │             │
        └─────────────┴─────────────────────────┴─────────────┘
                                   │
                                   ▼
┌─────────────────────────────────────────────────────────────────────────┐
│                              MLX                                        │
│                (Apple ML Framework - Metal kernels)                      │
└─────────────────────────────────────────────────────────────────────────┘

Performance

LLM Performance (M4 Max, 128GB):

Model Speed Memory
Qwen3-0.6B-8bit 402 tok/s 0.7 GB
Llama-3.2-1B-4bit 464 tok/s 0.7 GB
Llama-3.2-3B-4bit 200 tok/s 1.8 GB

Continuous Batching (5 concurrent requests):

Model Single Batched Speedup
Qwen3-0.6B-8bit 328 tok/s 1112 tok/s 3.4x
Llama-3.2-1B-4bit 299 tok/s 613 tok/s 2.0x

Audio - Speech-to-Text (M4 Max, 128GB):

Model RTF* Use Case
whisper-tiny 197x Real-time, low latency
whisper-large-v3-turbo 55x Best quality/speed balance
whisper-large-v3 24x Highest accuracy

*RTF = Real-Time Factor. RTF of 100x means 1 minute transcribes in ~0.6 seconds.

See benchmarks for detailed results.

Gemma 3 Support

vllm-mlx includes native support for Gemma 3 vision models. Gemma 3 is automatically detected as MLLM.

Usage

# Start server with Gemma 3
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Verify it loaded as MLLM (not LLM)
curl http://localhost:8000/health
# Should show: "model_type": "mllm"

Long Context Patch (mlx-vlm)

Gemma 3's default sliding_window=1024 limits context to ~10K tokens on Apple Silicon (Metal GPU timeout at higher context). To enable longer context (up to ~50K tokens), patch mlx-vlm:

Location: ~/.../site-packages/mlx_vlm/models/gemma3/language.py

Find the make_cache method and replace with:

def make_cache(self):
    import os
    # Set GEMMA3_SLIDING_WINDOW=8192 for ~40K context
    # Set GEMMA3_SLIDING_WINDOW=0 for ~50K context (full KVCache)
    sliding_window = int(os.environ.get('GEMMA3_SLIDING_WINDOW', self.config.sliding_window))

    caches = []
    for i in range(self.config.num_hidden_layers):
        if (
            i % self.config.sliding_window_pattern
            == self.config.sliding_window_pattern - 1
        ):
            caches.append(KVCache())
        elif sliding_window == 0:
            caches.append(KVCache())  # Full context for all layers
        else:
            caches.append(RotatingKVCache(max_size=sliding_window, keep=0))
    return caches

Usage:

# Default (~10K max context)
vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Extended context (~40K max)
GEMMA3_SLIDING_WINDOW=8192 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

# Maximum context (~50K max)
GEMMA3_SLIDING_WINDOW=0 vllm-mlx serve mlx-community/gemma-3-27b-it-4bit --port 8000

Benchmark Results (M4 Max 128GB):

Setting Max Context Memory
Default (1024) ~10K tokens ~16GB
GEMMA3_SLIDING_WINDOW=8192 ~40K tokens ~25GB
GEMMA3_SLIDING_WINDOW=0 ~50K tokens ~35GB

Contributing

We welcome contributions! See Contributing Guide for details.

  • Bug fixes and improvements
  • Performance optimizations
  • Documentation improvements
  • Benchmarks on different Apple Silicon chips

Submit PRs to: https://github.com/waybarrios/vllm-mlx

License

Apache 2.0 - see LICENSE for details.

Citation

If you use vLLM-MLX in your research or project, please cite:

@software{vllm_mlx2025,
  author = {Barrios, Wayner},
  title = {vLLM-MLX: Apple Silicon MLX Backend for vLLM},
  year = {2025},
  url = {https://github.com/waybarrios/vllm-mlx},
  note = {Native GPU-accelerated LLM and vision-language model inference on Apple Silicon}
}

Acknowledgments

About

OpenAI and Anthropic compatible server for Apple Silicon. Run LLMs and vision-language models (Llama, Qwen-VL, LLaVA) with continuous batching, MCP tool calling, and multimodal support. Native MLX backend, 400+ tok/s. Works with Claude Code.

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published

Languages