Merge pull request #24017 from ilopezluna/add-diffusers-support

dvdksn · web-flow · commit c1ecf1493071 · 2026-01-27T11:57:47.000+01:00
feat: add support for Diffusers inference engine and image generation API
diff --git a/content/manuals/ai/model-runner/_index.md b/content/manuals/ai/model-runner/_index.md
@@ -6,7 +6,7 @@ params:
     group: AI
 weight: 30
 description: Learn how to use Docker Model Runner to manage and run AI models.
-keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor
+keywords: Docker, ai, model runner, docker desktop, docker engine, llm, openai, ollama, llama.cpp, vllm, diffusers, cpu, nvidia, cuda, amd, rocm, vulkan, cline, continue, cursor, image generation, stable diffusion
 aliases:
   - /desktop/features/model-runner/
   - /model-runner/
@@ -34,7 +34,8 @@ with AI models locally.
 
 - [Pull and push models to and from Docker Hub](https://hub.docker.com/u/ai)
 - Serve models on [OpenAI and Ollama-compatible APIs](api-reference.md) for easy integration with existing apps
-- Support for both [llama.cpp and vLLM inference engines](inference-engines.md) (vLLM on Linux x86_64/amd64 and Windows WSL2 with NVIDIA GPUs)
+- Support for [llama.cpp, vLLM, and Diffusers inference engines](inference-engines.md) (vLLM and Diffusers on Linux with NVIDIA GPUs)
+- [Generate images from text prompts](inference-engines.md#diffusers) using Stable Diffusion models with the Diffusers backend
 - Package GGUF and Safetensors files as OCI Artifacts and publish them to any Container Registry
 - Run and interact with AI models directly from the command line or from the Docker Desktop GUI
 - [Connect to AI coding tools](ide-integrations.md) like Cline, Continue, Cursor, and Aider
@@ -89,14 +90,15 @@ access. You can interact with the model using
 
 ### Inference engines
 
-Docker Model Runner supports two inference engines:
+Docker Model Runner supports three inference engines:
 
 | Engine | Best for | Model format |
 |--------|----------|--------------|
 | [llama.cpp](inference-engines.md#llamacpp) | Local development, resource efficiency | GGUF (quantized) |
 | [vLLM](inference-engines.md#vllm) | Production, high throughput | Safetensors |
+| [Diffusers](inference-engines.md#diffusers) | Image generation (Stable Diffusion) | Safetensors |
 
-llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. See [Inference engines](inference-engines.md) for detailed comparison and setup.
+llama.cpp is the default engine and works on all platforms. vLLM requires NVIDIA GPUs and is supported on Linux x86_64 and Windows with WSL2. Diffusers enables image generation and requires NVIDIA GPUs on Linux (x86_64 or ARM64). See [Inference engines](inference-engines.md) for detailed comparison and setup.
 
 ### Context size
 
@@ -159,6 +161,6 @@ Thanks for trying out Docker Model Runner. To report bugs or request features, [
 - [Get started with DMR](get-started.md) - Enable DMR and run your first model
 - [API reference](api-reference.md) - OpenAI and Ollama-compatible API documentation
 - [Configuration options](configuration.md) - Context size and runtime parameters
-- [Inference engines](inference-engines.md) - llama.cpp and vLLM details
+- [Inference engines](inference-engines.md) - llama.cpp, vLLM, and Diffusers details
 - [IDE integrations](ide-integrations.md) - Connect Cline, Continue, Cursor, and more
 - [Open WebUI integration](openwebui-integration.md) - Set up a web chat interface
diff --git a/content/manuals/ai/model-runner/api-reference.md b/content/manuals/ai/model-runner/api-reference.md
@@ -68,6 +68,7 @@ Docker Model Runner supports multiple API formats:
 | [OpenAI API](#openai-compatible-api) | OpenAI-compatible chat completions, embeddings | Most AI frameworks and tools |
 | [Anthropic API](#anthropic-compatible-api) | Anthropic-compatible messages endpoint | Tools built for Claude |
 | [Ollama API](#ollama-compatible-api) | Ollama-compatible endpoints | Tools built for Ollama |
+| [Image Generation API](#image-generation-api-diffusers) | Diffusers-based image generation | Generating images from text prompts |
 | [DMR API](#dmr-native-endpoints) | Native Docker Model Runner endpoints | Model management |
 
 ## OpenAI-compatible API
@@ -223,6 +224,63 @@ curl http://localhost:12434/api/chat \
 curl http://localhost:12434/api/tags
 ```
 
+## Image generation API (Diffusers)
+
+DMR supports image generation through the Diffusers backend, enabling you to generate
+images from text prompts using models like Stable Diffusion.
+
+> [!NOTE]
+> The Diffusers backend requires an NVIDIA GPU with CUDA support and is only
+> available on Linux (x86_64 and ARM64). See [Inference engines](inference-engines.md#diffusers)
+> for setup instructions.
+
+### Endpoint
+
+| Endpoint | Method | Description |
+|----------|--------|-------------|
+| `/engines/diffusers/v1/images/generations` | POST | Generate an image from a text prompt |
+
+### Supported parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
+| `prompt` | string | Required. The text description of the image to generate. |
+| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |
+
+### Response format
+
+The API returns a JSON response with the generated image encoded in base64:
+
+```json
+{
+  "data": [
+    {
+      "b64_json": "<base64-encoded-image-data>"
+    }
+  ]
+}
+```
+
+### Example: Generate an image
+
+```bash
+curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "stable-diffusion:Q4",
+    "prompt": "A picture of a nice cat",
+    "size": "512x512"
+  }' | jq -r '.data[0].b64_json' | base64 -d > image.png
+```
+
+This command:
+1. Sends a POST request to the Diffusers image generation endpoint
+2. Specifies the model, prompt, and output image size
+3. Extracts the base64-encoded image from the response using `jq`
+4. Decodes the base64 data and saves it as `image.png`
+
+
 ## DMR native endpoints
 
 These endpoints are specific to Docker Model Runner for model management:
@@ -378,4 +436,4 @@ console.log(response.choices[0].message.content);
 
 - [IDE and tool integrations](ide-integrations.md) - Configure Cline, Continue, Cursor, and other tools
 - [Configuration options](configuration.md) - Adjust context size and runtime parameters
-- [Inference engines](inference-engines.md) - Learn about llama.cpp and vLLM options
+- [Inference engines](inference-engines.md) - Learn about llama.cpp, vLLM, and Diffusers options
diff --git a/content/manuals/ai/model-runner/inference-engines.md b/content/manuals/ai/model-runner/inference-engines.md
@@ -1,27 +1,28 @@
 ---
 title: Inference engines
-description: Learn about the llama.cpp and vLLM inference engines in Docker Model Runner.
+description: Learn about the llama.cpp, vLLM, and Diffusers inference engines in Docker Model Runner.
 weight: 50
-keywords: Docker, ai, model runner, llama.cpp, vllm, inference, gguf, safetensors, cuda, gpu
+keywords: Docker, ai, model runner, llama.cpp, vllm, diffusers, inference, gguf, safetensors, cuda, gpu, image generation, stable diffusion
 ---
 
-Docker Model Runner supports two inference engines: **llama.cpp** and **vLLM**.
+Docker Model Runner supports three inference engines: **llama.cpp**, **vLLM**, and **Diffusers**.
 Each engine has different strengths, supported platforms, and model format
 requirements. This guide helps you choose the right engine and configure it for
 your use case.
 
 ## Engine comparison
 
-| Feature | llama.cpp | vLLM |
-|---------|-----------|------|
-| **Model formats** | GGUF | Safetensors, HuggingFace |
-| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only |
-| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only |
-| **CPU inference** | Yes | No |
-| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited |
-| **Memory efficiency** | High (with quantization) | Moderate |
-| **Throughput** | Good | High (with batching) |
-| **Best for** | Local development, resource-constrained environments | Production, high throughput |
+| Feature | llama.cpp | vLLM | Diffusers                           |
+|---------|-----------|------|-------------------------------------|
+| **Model formats** | GGUF | Safetensors, HuggingFace | DDUF                                |
+| **Platforms** | All (macOS, Windows, Linux) | Linux x86_64 only | Linux (x86_64, ARM64)               |
+| **GPU support** | NVIDIA, AMD, Apple Silicon, Vulkan | NVIDIA CUDA only | NVIDIA CUDA only                    |
+| **CPU inference** | Yes | No | No                                  |
+| **Quantization** | Built-in (Q4, Q5, Q8, etc.) | Limited | Limited                             |
+| **Memory efficiency** | High (with quantization) | Moderate | Moderate                            |
+| **Throughput** | Good | High (with batching) | Good                                |
+| **Best for** | Local development, resource-constrained environments | Production, high throughput | Image generation                    |
+| **Use case** | Text generation (LLMs) | Text generation (LLMs) | Image generation (Stable Diffusion) |
 
 ## llama.cpp
 
@@ -205,9 +206,95 @@ $ docker model configure --hf_overrides '{"max_model_len": 8192}' ai/model-vllm
 | Apple Silicon Mac | llama.cpp |
 | Production deployment | vLLM (if hardware supports it) |
 
-## Running both engines
+## Diffusers
 
-You can run both llama.cpp and vLLM simultaneously. Docker Model Runner routes
+[Diffusers](https://github.com/huggingface/diffusers) is an inference engine
+for image generation models, including Stable Diffusion. Unlike llama.cpp and
+vLLM which focus on text generation with LLMs, Diffusers enables you to generate
+images from text prompts.
+
+### Platform support
+
+| Platform | GPU | Support status |
+|----------|-----|----------------|
+| Linux x86_64 | NVIDIA CUDA | Supported |
+| Linux ARM64 | NVIDIA CUDA | Supported |
+| Windows | - | Not supported |
+| macOS | - | Not supported |
+
+> [!IMPORTANT]
+> Diffusers requires an NVIDIA GPU with CUDA support. It does not support
+> CPU-only inference.
+
+### Setting up Diffusers
+
+Install the Model Runner with Diffusers backend:
+
+```console
+$ docker model reinstall-runner --backend diffusers --gpu cuda
+```
+
+Verify the installation:
+
+```console
+$ docker model status
+Docker Model Runner is running
+
+Status:
+llama.cpp: running llama.cpp version: 34ce48d
+mlx: not installed
+sglang: sglang package not installed
+vllm: vLLM binary not found
+diffusers: running diffusers version: 0.36.0
+```
+
+### Pulling Diffusers models
+
+Pull a Stable Diffusion model:
+
+```console
+$ docker model pull stable-diffusion:Q4
+```
+
+### Generating images with Diffusers
+
+Diffusers uses an image generation API endpoint. To generate an image:
+
+```console
+$ curl -s -X POST http://localhost:12434/engines/diffusers/v1/images/generations \
+  -H "Content-Type: application/json" \
+  -d '{
+    "model": "stable-diffusion:Q4",
+    "prompt": "A picture of a nice cat",
+    "size": "512x512"
+  }' | jq -r '.data[0].b64_json' | base64 -d > image.png
+```
+
+This command:
+1. Sends a POST request to the Diffusers image generation endpoint
+2. Specifies the model, prompt, and output image size
+3. Extracts the base64-encoded image from the response
+4. Decodes it and saves it as `image.png`
+
+### Diffusers API endpoint
+
+When using Diffusers, specify the engine in the API path:
+
+```text
+POST /engines/diffusers/v1/images/generations
+```
+
+### Supported parameters
+
+| Parameter | Type | Description |
+|-----------|------|-------------|
+| `model` | string | Required. The model identifier (e.g., `stable-diffusion:Q4`). |
+| `prompt` | string | Required. The text description of the image to generate. |
+| `size` | string | Image dimensions in `WIDTHxHEIGHT` format (e.g., `512x512`). |
+
+## Running multiple engines
+
+You can run llama.cpp, vLLM, and Diffusers simultaneously. Docker Model Runner routes
 requests to the appropriate engine based on the model or explicit engine selection.
 
 Check which engines are running:
@@ -217,17 +304,21 @@ $ docker model status
 Docker Model Runner is running
 
 Status:
-llama.cpp: running llama.cpp version: c22473b
+llama.cpp: running llama.cpp version: 34ce48d
+mlx: not installed
+sglang: sglang package not installed
 vllm: running vllm version: 0.11.0
+diffusers: running diffusers version: 0.36.0
 ```
 
 ### Engine-specific API paths
 
-| Engine | API path |
-|--------|----------|
-| llama.cpp | `/engines/llama.cpp/v1/...` |
-| vLLM | `/engines/vllm/v1/...` |
-| Auto-select | `/engines/v1/...` |
+| Engine | API path | Use case |
+|--------|----------|----------|
+| llama.cpp | `/engines/llama.cpp/v1/chat/completions` | Text generation |
+| vLLM | `/engines/vllm/v1/chat/completions` | Text generation |
+| Diffusers | `/engines/diffusers/v1/images/generations` | Image generation |
+| Auto-select | `/engines/v1/chat/completions` | Text generation (auto-selects engine) |
 
 ## Managing inference engines
 
@@ -238,7 +329,7 @@ $ docker model install-runner --backend <engine> [--gpu <type>]
 ```
 
 Options:
-- `--backend`: `llama.cpp` or `vllm`
+- `--backend`: `llama.cpp`, `vllm`, or `diffusers`
 - `--gpu`: `cuda`, `rocm`, `vulkan`, or `metal` (depends on platform)
 
 ### Reinstall an engine