Skip to content

Commit ba47a16

Browse files
committed
Add dmrlet container orchestrator for AI inference
This commit introduces dmrlet, a purpose-built container orchestrator designed specifically for AI inference workloads. Unlike Kubernetes, dmrlet focuses exclusively on running stateless inference containers with zero configuration overhead. Multi-GPU mapping "just works" without YAML, device plugins, or node selectors. The orchestrator supports multiple inference backends including llama.cpp, vLLM, and SGLang, with automatic backend detection based on model format. It provides seamless multi-GPU allocation and management, along with auto-scaling based on QPS, latency, and GPU utilization metrics. Signed-off-by: Eric Curtin <eric.curtin@docker.com>
1 parent 232f06b commit ba47a16

File tree

26 files changed

+5270
-46
lines changed

26 files changed

+5270
-46
lines changed

README.md

Lines changed: 109 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -415,6 +415,115 @@ in the form of [a Helm chart and static YAML](charts/docker-model-runner/README.
415415
If you are interested in a specific Kubernetes use-case, please start a
416416
discussion on the issue tracker.
417417

418+
## dmrlet: Container Orchestrator for AI Inference
419+
420+
dmrlet is a purpose-built container orchestrator for AI inference workloads. Unlike Kubernetes, it focuses exclusively on running stateless inference containers with zero configuration overhead. Multi-GPU mapping "just works" without YAML, device plugins, or node selectors.
421+
422+
### Key Features
423+
424+
| Feature | Kubernetes | dmrlet |
425+
|---------|------------|--------|
426+
| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
427+
| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
428+
| Time to first inference | Minutes (pod scheduling, image pull) | Seconds (model already local) |
429+
| Model management | External (mount PVCs, manage yourself) | Integrated with Docker Model Runner store |
430+
431+
### Building dmrlet
432+
433+
```bash
434+
# Build the dmrlet binary
435+
go build -o dmrlet ./cmd/dmrlet
436+
437+
# Verify it works
438+
./dmrlet --help
439+
```
440+
441+
### Usage
442+
443+
**Start the daemon:**
444+
```bash
445+
# Start in foreground
446+
dmrlet daemon
447+
448+
# With custom socket path
449+
dmrlet daemon --socket /tmp/dmrlet.sock
450+
```
451+
452+
**Serve a model:**
453+
```bash
454+
# Auto-detect backend and GPUs
455+
dmrlet serve llama3.2
456+
457+
# Specify backend
458+
dmrlet serve llama3.2 --backend vllm
459+
460+
# Specify GPU allocation
461+
dmrlet serve llama3.2 --gpus 0,1
462+
dmrlet serve llama3.2 --gpus all
463+
464+
# Multiple replicas
465+
dmrlet serve llama3.2 --replicas 2
466+
467+
# Backend-specific options
468+
dmrlet serve llama3.2 --ctx-size 4096 # llama.cpp context size
469+
dmrlet serve llama3.2 --gpu-memory 0.8 # vLLM GPU memory utilization
470+
```
471+
472+
**List running models:**
473+
```bash
474+
dmrlet ps
475+
# MODEL BACKEND REPLICAS GPUS ENDPOINTS STATUS
476+
# llama3.2 llama.cpp 1 [0,1,2,3] localhost:30000 healthy
477+
```
478+
479+
**View logs:**
480+
```bash
481+
dmrlet logs llama3.2 # Last 100 lines
482+
dmrlet logs llama3.2 -f # Follow logs
483+
```
484+
485+
**Scale replicas:**
486+
```bash
487+
dmrlet scale llama3.2 4 # Scale to 4 replicas
488+
```
489+
490+
**Stop a model:**
491+
```bash
492+
dmrlet stop llama3.2
493+
dmrlet stop --all # Stop all models
494+
```
495+
496+
**Check status:**
497+
```bash
498+
dmrlet status
499+
# DAEMON: running
500+
# SOCKET: /var/run/dmrlet.sock
501+
#
502+
# GPUS:
503+
# GPU 0: NVIDIA A100 80GB 81920MB (in use: llama3.2)
504+
# GPU 1: NVIDIA A100 80GB 81920MB (available)
505+
#
506+
# MODELS: 1 running
507+
```
508+
509+
### Supported Backends
510+
511+
- **llama.cpp** - Default backend for GGUF models
512+
- **vLLM** - High-throughput serving for safetensors models
513+
- **SGLang** - Fast serving with RadixAttention
514+
515+
### Architecture
516+
517+
```
518+
dmrlet daemon
519+
├── GPU Manager - Auto-detect and allocate GPUs
520+
├── Container Manager - Docker-based container lifecycle
521+
├── Service Registry - Endpoint discovery with load balancing
522+
├── Health Monitor - Auto-restart unhealthy containers
523+
├── Auto-scaler - Scale based on QPS/latency/GPU utilization
524+
└── Log Aggregator - Centralized log collection
525+
```
526+
418527
## Community
419528

420529
For general questions and discussion, please use [Docker Model Runner's Slack channel](https://dockercommunity.slack.com/archives/C09H9P5E57B).

cmd/dmrlet/commands/daemon.go

Lines changed: 86 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,86 @@
1+
package commands
2+
3+
import (
4+
"context"
5+
"fmt"
6+
"os"
7+
"os/signal"
8+
"syscall"
9+
10+
"github.com/docker/model-runner/pkg/dmrlet/daemon"
11+
"github.com/spf13/cobra"
12+
)
13+
14+
var (
15+
containerdAddress string
16+
modelStorePath string
17+
)
18+
19+
func newDaemonCmd() *cobra.Command {
20+
cmd := &cobra.Command{
21+
Use: "daemon",
22+
Short: "Start the dmrlet daemon",
23+
Long: `Start the dmrlet daemon process.
24+
25+
The daemon manages inference containers, handles GPU allocation,
26+
and provides the API for other dmrlet commands.
27+
28+
Examples:
29+
# Start daemon with default settings
30+
dmrlet daemon
31+
32+
# Start with custom socket path
33+
dmrlet daemon --socket /tmp/dmrlet.sock
34+
35+
# Start with custom containerd address
36+
dmrlet daemon --containerd /run/containerd/containerd.sock`,
37+
RunE: runDaemon,
38+
}
39+
40+
cmd.Flags().StringVar(&containerdAddress, "containerd", "/run/containerd/containerd.sock",
41+
"Path to containerd socket")
42+
cmd.Flags().StringVar(&modelStorePath, "store", "",
43+
"Path to model store (default: ~/.docker/model-runner/models)")
44+
45+
return cmd
46+
}
47+
48+
func runDaemon(cmd *cobra.Command, args []string) error {
49+
config := daemon.DefaultConfig()
50+
config.SocketPath = socketPath
51+
config.ContainerdAddress = containerdAddress
52+
if modelStorePath != "" {
53+
config.ModelStorePath = modelStorePath
54+
}
55+
56+
d, err := daemon.New(config)
57+
if err != nil {
58+
return fmt.Errorf("failed to create daemon: %w", err)
59+
}
60+
61+
ctx, cancel := context.WithCancel(context.Background())
62+
defer cancel()
63+
64+
// Handle signals
65+
sigCh := make(chan os.Signal, 1)
66+
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
67+
68+
// Start daemon
69+
if err := d.Start(ctx); err != nil {
70+
return fmt.Errorf("failed to start daemon: %w", err)
71+
}
72+
73+
fmt.Printf("dmrlet daemon started on %s\n", config.SocketPath)
74+
75+
// Wait for signal
76+
sig := <-sigCh
77+
fmt.Printf("\nReceived signal %v, shutting down...\n", sig)
78+
79+
// Graceful shutdown
80+
if err := d.Stop(ctx); err != nil {
81+
return fmt.Errorf("failed to stop daemon: %w", err)
82+
}
83+
84+
fmt.Println("Daemon stopped")
85+
return nil
86+
}

cmd/dmrlet/commands/logs.go

Lines changed: 74 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,74 @@
1+
package commands
2+
3+
import (
4+
"context"
5+
"fmt"
6+
"os"
7+
"os/signal"
8+
"syscall"
9+
10+
"github.com/docker/model-runner/pkg/dmrlet/daemon"
11+
"github.com/spf13/cobra"
12+
)
13+
14+
var (
15+
logsFollow bool
16+
logsTail int
17+
)
18+
19+
func newLogsCmd() *cobra.Command {
20+
cmd := &cobra.Command{
21+
Use: "logs MODEL",
22+
Short: "View logs for a model",
23+
Long: `View logs from the inference containers for a model.
24+
25+
Examples:
26+
# View last 100 lines
27+
dmrlet logs llama3.2
28+
29+
# Follow logs in real-time
30+
dmrlet logs llama3.2 -f
31+
32+
# View last 50 lines
33+
dmrlet logs llama3.2 --tail 50`,
34+
Args: cobra.ExactArgs(1),
35+
RunE: runLogs,
36+
}
37+
38+
cmd.Flags().BoolVarP(&logsFollow, "follow", "f", false, "Follow log output")
39+
cmd.Flags().IntVar(&logsTail, "tail", 100, "Number of lines to show from the end")
40+
41+
return cmd
42+
}
43+
44+
func runLogs(cmd *cobra.Command, args []string) error {
45+
model := args[0]
46+
47+
client := daemon.NewClient(socketPath)
48+
49+
ctx, cancel := context.WithCancel(context.Background())
50+
defer cancel()
51+
52+
// Handle Ctrl+C
53+
sigCh := make(chan os.Signal, 1)
54+
signal.Notify(sigCh, syscall.SIGINT, syscall.SIGTERM)
55+
go func() {
56+
<-sigCh
57+
cancel()
58+
}()
59+
60+
logChan, err := client.StreamLogs(ctx, model, logsTail, logsFollow)
61+
if err != nil {
62+
return fmt.Errorf("failed to get logs: %w", err)
63+
}
64+
65+
for line := range logChan {
66+
if line.Timestamp.IsZero() {
67+
fmt.Print(line.Message)
68+
} else {
69+
fmt.Printf("[%s] %s\n", line.Timestamp.Format("2006-01-02 15:04:05"), line.Message)
70+
}
71+
}
72+
73+
return nil
74+
}

cmd/dmrlet/commands/ps.go

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
package commands
2+
3+
import (
4+
"context"
5+
"fmt"
6+
"os"
7+
"strings"
8+
"text/tabwriter"
9+
10+
"github.com/docker/model-runner/pkg/dmrlet/daemon"
11+
"github.com/spf13/cobra"
12+
)
13+
14+
func newPsCmd() *cobra.Command {
15+
cmd := &cobra.Command{
16+
Use: "ps",
17+
Short: "List running inference containers",
18+
Long: `List all running inference containers.
19+
20+
Examples:
21+
# List all running models
22+
dmrlet ps`,
23+
RunE: runPs,
24+
}
25+
26+
return cmd
27+
}
28+
29+
func runPs(cmd *cobra.Command, args []string) error {
30+
client := daemon.NewClient(socketPath)
31+
32+
resp, err := client.List(context.Background())
33+
if err != nil {
34+
return fmt.Errorf("failed to list models: %w", err)
35+
}
36+
37+
if len(resp.Models) == 0 {
38+
fmt.Println("No models running")
39+
return nil
40+
}
41+
42+
w := tabwriter.NewWriter(os.Stdout, 0, 0, 2, ' ', 0)
43+
fmt.Fprintln(w, "MODEL\tBACKEND\tREPLICAS\tGPUS\tENDPOINTS\tSTATUS")
44+
45+
for _, m := range resp.Models {
46+
gpus := formatGPUList(m.GPUs)
47+
endpoints := formatEndpoints(m.Endpoints)
48+
49+
fmt.Fprintf(w, "%s\t%s\t%d\t%s\t%s\t%s\n",
50+
m.Model,
51+
m.Backend,
52+
m.Replicas,
53+
gpus,
54+
endpoints,
55+
m.Status,
56+
)
57+
}
58+
59+
return w.Flush()
60+
}
61+
62+
func formatGPUList(gpus []int) string {
63+
if len(gpus) == 0 {
64+
return "-"
65+
}
66+
strs := make([]string, len(gpus))
67+
for i, g := range gpus {
68+
strs[i] = fmt.Sprintf("%d", g)
69+
}
70+
return "[" + strings.Join(strs, ",") + "]"
71+
}
72+
73+
func formatEndpoints(endpoints []string) string {
74+
if len(endpoints) == 0 {
75+
return "-"
76+
}
77+
if len(endpoints) == 1 {
78+
return endpoints[0]
79+
}
80+
return strings.Join(endpoints, ",")
81+
}

0 commit comments

Comments
 (0)