Skip to content

Commit c1ecf14

Browse files
authored
Merge pull request #24017 from ilopezluna/add-diffusers-support
feat: add support for Diffusers inference engine and image generation API
2 parents b9a10ed + c0e77f2 commit c1ecf14

File tree

3 files changed

+179
-28
lines changed

3 files changed

+179
-28
lines changed

content/manuals/ai/model-runner/_index.md

Lines changed: 7 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ params:
66
group: AI
77
weight: 30
88
description: Learn how to use Docker Model Runner to manage and run AI models.
9-
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor
9+
keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, diffusers, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor, image generation, stable diffusion
1010
aliases:
1111
- /desktop/features/model-runner/
1212
- /model-runner/
@@ -34,7 +34,8 @@ with AI models locally.
3434

3535
- [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
3636
- Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
37-
- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
37+
- Support for [llama.cpp, vLLM, and Diffusers inference engines](inference-engines.md) (vLLM and Diffusers on Linux with NVIDIA GPUs)
38+
- [Generate images from text prompts](inference-engines.md#diffusers) using Stable Diffusion models with the Diffusers backend
3839
- Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
3940
- Run and interact with AI models directly from the command line or from the Docker Desktop GUI
4041
- [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
@@ -89,14 +90,15 @@ access. You can interact with the model using
8990

9091
### Inference engines
9192

92-
Docker Model Runner supports two inference engines:
93+
Docker Model Runner supports three inference engines:
9394

9495
| Engine | Best for | Model format |
9596
|--------|----------|--------------|
9697
| [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) |
9798
| [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors |
99+
| [Diffusers](inference-engines.md#diffusers) | Image generation (Stable Diffusion) | Safetensors |
98100

99-
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
101+
llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. Diffusers enables image generation and requires NVIDIA GPUs on Linux (x86_64 or ARM64). See [Inference engines](inference-engines.md) for detailed comparison and setup.
100102

101103
### Context size
102104

@@ -159,6 +161,6 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
159161
- [Get started with DMR](get-started.md) - Enable DMR and run your first model
160162
- [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
161163
- [Configuration options](configuration.md) - Context size and runtime parameters
162-
- [Inference engines](inference-engines.md) - llama.cpp and vLLM details
164+
- [Inference engines](inference-engines.md) - llama.cpp, vLLM, and Diffusers details
163165
- [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
164166
- [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface

content/manuals/ai/model-runner/api-reference.md

Lines changed: 59 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,7 @@ Docker Model Runner supports multiple API formats:
6868
| [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools |
6969
| [Anthropic API](#anthropic-compatible-api) | Anthropic-compatible messages endpoint | Tools built for Claude |
7070
| [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama |
71+
| [Image Generation API](#image-generation-api-diffusers) | Diffusers-based image generation | Generating images from text prompts |
7172
| [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management |
7273
7374
## OpenAI-compatible API
@@ -223,6 +224,63 @@ curl http://localhost:12434/api/chat \
223224
curl http://localhost:12434/api/tags
224225
```
225226
227+
## Image generation API (Diffusers)
228+
229+
DMR supports image generation through the Diffusers backend, enabling you to generate
230+
images from text prompts using models like Stable Diffusion.
231+
232+
> [!NOTE]
233+
> The Diffusers backend requires an NVIDIA GPU with CUDA support and is only
234+
> available on Linux (x86_64 and ARM64). See [Inference engines](inference-engines.md#diffusers)
235+
> for setup instructions.
236+
237+
### Endpoint
238+
239+
| Endpoint | Method | Description |
240+
|----------|--------|-------------|
241+
| `/engines/diffusers/v1/images/generations` | POST | Generate an image from a text prompt |
242+
243+
### Supported parameters
244+
245+
| Parameter | Type | Description |
246+
|-----------|------|-------------|
247+
| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
248+
| `prompt` | string | Required. The text description of the image to generate. |
249+
| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |
250+
251+
### Response format
252+
253+
The API returns a JSON response with the generated image encoded in base64:
254+
255+
```json
256+
{
257+
"data": [
258+
{
259+
"b64_json": "<base64-encoded-image-data>"
260+
}
261+
]
262+
}
263+
```
264+
265+
### Example: Generate an image
266+
267+
```bash
268+
curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
269+
-H "Content-Type: application/json" \
270+
-d '{
271+
"model": "stable-diffusion:Q4",
272+
"prompt": "A picture of a nice cat",
273+
"size": "512x512"
274+
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
275+
```
276+
277+
This command:
278+
1. Sends a POST request to the Diffusers image generation endpoint
279+
2. Specifies the model, prompt, and output image size
280+
3. Extracts the base64-encoded image from the response using `jq`
281+
4. Decodes the base64 data and saves it as `image.png`
282+
283+
226284
## DMR native endpoints
227285
228286
These endpoints are specific to Docker Model Runner for model management:
@@ -378,4 +436,4 @@ console.log(response.choices[0].message.content);
378436
379437
- [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools
380438
- [Configuration options](configuration.md) - Adjust context size and runtime parameters
381-
- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options
439+
- [Inference engines](inference-engines.md) - Learn about llama.cpp, vLLM, and Diffusers options

content/manuals/ai/model-runner/inference-engines.md

Lines changed: 113 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,27 +1,28 @@
11
---
22
title: Inference engines
3-
description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
3+
description: Learn about the llama.cpp, vLLM, and Diffusers inference engines in Docker Model Runner.
44
weight: 50
5-
keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
5+
keywords: Docker, ai, model runner, llama.cpp, vllm, diffusers, inference, gguf, safetensors, cuda, gpu, image generation, stable diffusion
66
---
77

8-
Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**.
8+
Docker Model Runner supports three inference engines: **llama.cpp**, **vLLM**, and **Diffusers**.
99
Each engine has different strengths, supported platforms, and model format
1010
requirements. This guide helps you choose the right engine and configure it for
1111
your use case.
1212

1313
## Engine comparison
1414

15-
| Feature | llama.cpp | vLLM |
16-
|---------|-----------|------|
17-
| **Model formats** | GGUF | Safetensors, HuggingFace |
18-
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
19-
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
20-
| **CPU inference** | Yes | No |
21-
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
22-
| **Memory efficiency** | High (with quantization) | Moderate |
23-
| **Throughput** | Good | High (with batching) |
24-
| **Best for** | Local development, resource-constrained environments | Production, high throughput |
15+
| Feature | llama.cpp | vLLM | Diffusers |
16+
|---------|-----------|------|-------------------------------------|
17+
| **Model formats** | GGUF | Safetensors, HuggingFace | DDUF |
18+
| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64) |
19+
| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only |
20+
| **CPU inference** | Yes | No | No |
21+
| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited |
22+
| **Memory efficiency** | High (with quantization) | Moderate | Moderate |
23+
| **Throughput** | Good | High (with batching) | Good |
24+
| **Best for** | Local development, resource-constrained environments | Production, high throughput | Image generation |
25+
| **Use case** | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) |
2526

2627
## llama.cpp
2728

@@ -205,9 +206,95 @@ $ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
205206
| Apple Silicon Mac | llama.cpp |
206207
| Production deployment | vLLM (if hardware supports it) |
207208

208-
## Running both engines
209+
## Diffusers
209210

210-
You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
211+
[Diffusers](https://github.com/huggingface/diffusers) is an inference engine
212+
for image generation models, including Stable Diffusion. Unlike llama.cpp and
213+
vLLM which focus on text generation with LLMs, Diffusers enables you to generate
214+
images from text prompts.
215+
216+
### Platform support
217+
218+
| Platform | GPU | Support status |
219+
|----------|-----|----------------|
220+
| Linux x86_64 | NVIDIA CUDA | Supported |
221+
| Linux ARM64 | NVIDIA CUDA | Supported |
222+
| Windows | - | Not supported |
223+
| macOS | - | Not supported |
224+
225+
> [!IMPORTANT]
226+
> Diffusers requires an NVIDIA GPU with CUDA support. It does not support
227+
> CPU-only inference.
228+
229+
### Setting up Diffusers
230+
231+
Install the Model Runner with Diffusers backend:
232+
233+
```console
234+
$ docker model reinstall-runner --backend diffusers --gpu cuda
235+
```
236+
237+
Verify the installation:
238+
239+
```console
240+
$ docker model status
241+
Docker Model Runner is running
242+
243+
Status:
244+
llama.cpp: running llama.cpp version: 34ce48d
245+
mlx: not installed
246+
sglang: sglang package not installed
247+
vllm: vLLM binary not found
248+
diffusers: running diffusers version: 0.36.0
249+
```
250+
251+
### Pulling Diffusers models
252+
253+
Pull a Stable Diffusion model:
254+
255+
```console
256+
$ docker model pull stable-diffusion:Q4
257+
```
258+
259+
### Generating images with Diffusers
260+
261+
Diffusers uses an image generation API endpoint. To generate an image:
262+
263+
```console
264+
$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
265+
-H "Content-Type: application/json" \
266+
-d '{
267+
"model": "stable-diffusion:Q4",
268+
"prompt": "A picture of a nice cat",
269+
"size": "512x512"
270+
}' | jq -r '.data[0].b64_json' | base64 -d > image.png
271+
```
272+
273+
This command:
274+
1. Sends a POST request to the Diffusers image generation endpoint
275+
2. Specifies the model, prompt, and output image size
276+
3. Extracts the base64-encoded image from the response
277+
4. Decodes it and saves it as `image.png`
278+
279+
### Diffusers API endpoint
280+
281+
When using Diffusers, specify the engine in the API path:
282+
283+
```text
284+
POST /engines/diffusers/v1/images/generations
285+
```
286+
287+
### Supported parameters
288+
289+
| Parameter | Type | Description |
290+
|-----------|------|-------------|
291+
| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
292+
| `prompt` | string | Required. The text description of the image to generate. |
293+
| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |
294+
295+
## Running multiple engines
296+
297+
You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes
211298
requests to the appropriate engine based on the model or explicit engine selection.
212299

213300
Check which engines are running:
@@ -217,17 +304,21 @@ $ docker model status
217304
Docker Model Runner is running
218305

219306
Status:
220-
llama.cpp: running llama.cpp version: c22473b
307+
llama.cpp: running llama.cpp version: 34ce48d
308+
mlx: not installed
309+
sglang: sglang package not installed
221310
vllm: running vllm version: 0.11.0
311+
diffusers: running diffusers version: 0.36.0
222312
```
223313

224314
### Engine-specific API paths
225315

226-
| Engine | API path |
227-
|--------|----------|
228-
| llama.cpp | `/engines/llama.cpp/v1/...` |
229-
| vLLM | `/engines/vllm/v1/...` |
230-
| Auto-select | `/engines/v1/...` |
316+
| Engine | API path | Use case |
317+
|--------|----------|----------|
318+
| llama.cpp | `/engines/llama.cpp/v1/chat/completions` | Text generation |
319+
| vLLM | `/engines/vllm/v1/chat/completions` | Text generation |
320+
| Diffusers | `/engines/diffusers/v1/images/generations` | Image generation |
321+
| Auto-select | `/engines/v1/chat/completions` | Text generation (auto-selects engine) |
231322

232323
## Managing inference engines
233324

@@ -238,7 +329,7 @@ $ docker model install-runner --backend <engine> [--gpu <type>]
238329
```
239330

240331
Options:
241-
- `--backend`: `llama.cpp` or `vllm`
332+
- `--backend`: `llama.cpp`, `vllm`, or `diffusers`
242333
- `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform)
243334

244335
### Reinstall an engine

0 commit comments

Comments
 (0)