11---
22title : Inference engines
3- description : Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
3+ description : Learn about the llama.cpp, vLLM, and Diffusers inference engines in Docker Model Runner.
44weight : 50
5- keywords : Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
5+ keywords : Docker, ai, model runner, llama.cpp, vllm, diffusers, inference, gguf, safetensors, cuda, gpu, image generation, stable diffusion
66---
77
8- Docker Model Runner supports two inference engines: ** llama.cpp** and ** vLLM ** .
8+ Docker Model Runner supports three inference engines: ** llama.cpp** , ** vLLM ** , and ** Diffusers ** .
99Each engine has different strengths, supported platforms, and model format
1010requirements. This guide helps you choose the right engine and configure it for
1111your use case.
1212
1313## Engine comparison
1414
15- | Feature | llama.cpp | vLLM |
16- | ---------| -----------| ------|
17- | ** Model formats** | GGUF | Safetensors, HuggingFace |
18- | ** Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
19- | ** GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
20- | ** CPU inference** | Yes | No |
21- | ** Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
22- | ** Memory efficiency** | High (with quantization) | Moderate |
23- | ** Throughput** | Good | High (with batching) |
24- | ** Best for** | Local development, resource-constrained environments | Production, high throughput |
15+ | Feature | llama.cpp | vLLM | Diffusers |
16+ | ---------| -----------| ------| -------------------------------------|
17+ | ** Model formats** | GGUF | Safetensors, HuggingFace | DDUF |
18+ | ** Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64) |
19+ | ** GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only |
20+ | ** CPU inference** | Yes | No | No |
21+ | ** Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited |
22+ | ** Memory efficiency** | High (with quantization) | Moderate | Moderate |
23+ | ** Throughput** | Good | High (with batching) | Good |
24+ | ** Best for** | Local development, resource-constrained environments | Production, high throughput | Image generation |
25+ | ** Use case** | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) |
2526
2627## llama.cpp
2728
@@ -205,9 +206,95 @@ $ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
205206| Apple Silicon Mac | llama.cpp |
206207| Production deployment | vLLM (if hardware supports it) |
207208
208- ## Running both engines
209+ ## Diffusers
209210
210- You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
211+ [ Diffusers] ( https://github.com/huggingface/diffusers ) is an inference engine
212+ for image generation models, including Stable Diffusion. Unlike llama.cpp and
213+ vLLM which focus on text generation with LLMs, Diffusers enables you to generate
214+ images from text prompts.
215+
216+ ### Platform support
217+
218+ | Platform | GPU | Support status |
219+ | ----------| -----| ----------------|
220+ | Linux x86_64 | NVIDIA CUDA | Supported |
221+ | Linux ARM64 | NVIDIA CUDA | Supported |
222+ | Windows | - | Not supported |
223+ | macOS | - | Not supported |
224+
225+ > [ !IMPORTANT]
226+ > Diffusers requires an NVIDIA GPU with CUDA support. It does not support
227+ > CPU-only inference.
228+
229+ ### Setting up Diffusers
230+
231+ Install the Model Runner with Diffusers backend:
232+
233+ ``` console
234+ $ docker model reinstall-runner --backend diffusers --gpu cuda
235+ ```
236+
237+ Verify the installation:
238+
239+ ``` console
240+ $ docker model status
241+ Docker Model Runner is running
242+
243+ Status:
244+ llama.cpp: running llama.cpp version: 34ce48d
245+ mlx: not installed
246+ sglang: sglang package not installed
247+ vllm: vLLM binary not found
248+ diffusers: running diffusers version: 0.36.0
249+ ```
250+
251+ ### Pulling Diffusers models
252+
253+ Pull a Stable Diffusion model:
254+
255+ ``` console
256+ $ docker model pull stable-diffusion:Q4
257+ ```
258+
259+ ### Generating images with Diffusers
260+
261+ Diffusers uses an image generation API endpoint. To generate an image:
262+
263+ ``` console
264+ $ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
265+ -H "Content-Type: application/json" \
266+ -d '{
267+ "model": "stable-diffusion:Q4",
268+ "prompt": "A picture of a nice cat",
269+ "size": "512x512"
270+ }' | jq -r '.data[0].b64_json' | base64 -d > image.png
271+ ```
272+
273+ This command:
274+ 1 . Sends a POST request to the Diffusers image generation endpoint
275+ 2 . Specifies the model, prompt, and output image size
276+ 3 . Extracts the base64-encoded image from the response
277+ 4 . Decodes it and saves it as ` image.png `
278+
279+ ### Diffusers API endpoint
280+
281+ When using Diffusers, specify the engine in the API path:
282+
283+ ``` text
284+ POST /engines/diffusers/v1/images/generations
285+ ```
286+
287+ ### Supported parameters
288+
289+ | Parameter | Type | Description |
290+ | -----------| ------| -------------|
291+ | ` model ` | string | Required. The model identifier (e.g., ` stable-diffusion:Q4 ` ). |
292+ | ` prompt ` | string | Required. The text description of the image to generate. |
293+ | ` size ` | string | Image dimensions in ` WIDTHxHEIGHT ` format (e.g., ` 512x512 ` ). |
294+
295+ ## Running multiple engines
296+
297+ You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes
211298requests to the appropriate engine based on the model or explicit engine selection.
212299
213300Check which engines are running:
@@ -217,17 +304,21 @@ $ docker model status
217304Docker Model Runner is running
218305
219306Status:
220- llama.cpp: running llama.cpp version: c22473b
307+ llama.cpp: running llama.cpp version: 34ce48d
308+ mlx: not installed
309+ sglang: sglang package not installed
221310vllm: running vllm version: 0.11.0
311+ diffusers: running diffusers version: 0.36.0
222312```
223313
224314### Engine-specific API paths
225315
226- | Engine | API path |
227- | --------| ----------|
228- | llama.cpp | ` /engines/llama.cpp/v1/... ` |
229- | vLLM | ` /engines/vllm/v1/... ` |
230- | Auto-select | ` /engines/v1/... ` |
316+ | Engine | API path | Use case |
317+ | --------| ----------| ----------|
318+ | llama.cpp | ` /engines/llama.cpp/v1/chat/completions ` | Text generation |
319+ | vLLM | ` /engines/vllm/v1/chat/completions ` | Text generation |
320+ | Diffusers | ` /engines/diffusers/v1/images/generations ` | Image generation |
321+ | Auto-select | ` /engines/v1/chat/completions ` | Text generation (auto-selects engine) |
231322
232323## Managing inference engines
233324
@@ -238,7 +329,7 @@ $ docker model install-runner --backend <engine> [--gpu <type>]
238329```
239330
240331Options:
241- - ` --backend ` : ` llama.cpp ` or ` vllm `
332+ - ` --backend ` : ` llama.cpp ` , ` vllm ` , or ` diffusers `
242333- ` --gpu ` : ` cuda ` , ` rocm ` , ` vulkan ` , or ` metal ` (depends on platform)
243334
244335### Reinstall an engine
0 commit comments