add dmrlet - lightweight node agent for Docker Model Runner #627

ericcurtin · 2026-02-03T23:24:53Z

dmrlet is a "Kubelet for AI" that runs inference containers directly
with zero YAML overhead. It provides a simple CLI to serve models:

dmrlet serve ai/smollm2

Pulls model, starts inference container, exposes OpenAI API

Key features:

Reuses existing pkg/distribution for model management
containerd integration for container lifecycle
GPU detection and passthrough (NVIDIA/AMD)
Auto port allocation (30000-30999 range)
Health checking with configurable timeout
Backend auto-detection (llama-server for GGUF, vLLM for safetensors)

Commands: serve, stop, list, pull, version

gemini-code-assist · 2026-02-03T23:25:52Z

Summary of Changes

Hello @ericcurtin, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates dmrlet, a new, specialized container orchestrator tailored for AI inference. The primary goal is to streamline the deployment and operational management of AI models by abstracting away the complexities often associated with general-purpose container orchestration. It focuses on providing an intuitive experience for handling GPU resources and scaling inference services, allowing users to quickly get models into production with minimal setup.

Highlights

New Container Orchestrator: Introduces dmrlet, a purpose-built container orchestrator specifically designed for AI inference workloads, offering a simplified alternative to general-purpose orchestrators.
Zero-Configuration Multi-GPU Support: Enables automatic multi-GPU mapping and management without requiring complex YAML configurations, device plugins, or node selectors, making GPU allocation 'just work'.
Flexible Backend Integration: Supports multiple popular AI inference backends, including llama.cpp, vLLM, and SGLang, with automatic backend detection based on model format.
Integrated Auto-Scaling: Implements auto-scaling capabilities for inference models based on key metrics such as Queries Per Second (QPS), latency, and GPU utilization, ensuring efficient resource usage.
Comprehensive CLI Tooling: Provides a full suite of command-line interface (CLI) commands for managing the dmrlet daemon, serving models, listing running instances, viewing logs, scaling replicas, stopping models, and checking system status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

README.md
- Added extensive documentation for dmrlet, covering its features, build process, usage examples for all CLI commands, supported inference backends, and architectural overview.
cmd/dmrlet/commands/daemon.go
- New file implementing the dmrlet daemon command, which starts and manages the core orchestrator process, including configuration for socket path, containerd address, and model store, with graceful shutdown handling.
cmd/dmrlet/commands/logs.go
- New file implementing the dmrlet logs command, allowing users to view and follow real-time logs from inference containers for a specified model.
cmd/dmrlet/commands/ps.go
- New file implementing the dmrlet ps command, which lists all running inference models, displaying details such as backend, replica count, assigned GPUs, endpoints, and current status.
cmd/dmrlet/commands/root.go
- New file defining the root command for the dmrlet CLI, setting up persistent flags and registering all subcommands for daemon management and model operations.
cmd/dmrlet/commands/scale.go
- New file implementing the dmrlet scale command, enabling users to dynamically adjust the number of replicas for a deployed model.
cmd/dmrlet/commands/serve.go
- New file implementing the dmrlet serve command, which initiates the serving of an AI model, allowing specification of the inference backend, GPU allocation, replica count, and backend-specific parameters.
cmd/dmrlet/commands/status.go
- New file implementing the dmrlet status command, providing a summary of the dmrlet daemon's operational state, detected GPU inventory, and the number of active models.
cmd/dmrlet/commands/stop.go
- New file implementing the dmrlet stop command, used to terminate a specific deployed model or all running models.
cmd/dmrlet/main.go
- New file serving as the entry point for the dmrlet CLI application.
go.mod
- Updated Go module dependencies, including downgrading github.com/containerd/containerd/v2 and github.com/containerd/platforms versions.
- Added github.com/spf13/cobra and github.com/spf13/pflag as direct dependencies, along with several new indirect dependencies.
go.sum
- Updated Go module checksums to reflect the changes in go.mod, including new indirect dependencies and version adjustments.
pkg/dmrlet/autoscaler/metrics.go
- New file defining the Metrics struct and Collector for gathering performance metrics (QPS, latency, GPU utilization) from inference containers, including fetching from HTTP endpoints and nvidia-smi.
pkg/dmrlet/autoscaler/scaler.go
- New file defining the Scaler for auto-scaling models based on collected metrics, incorporating ScalingConfig, ScaleAction, and logic for evaluating scaling decisions with cooldowns and delays.
pkg/dmrlet/container/manager.go
- New file implementing a Manager for container lifecycle management, currently utilizing the Docker CLI as its backend, handling creation, starting, stopping, removal, restarting, and log attachment for containers.
pkg/dmrlet/container/spec.go
- New file defining Backend types, BackendConfig for various inference backends (llama.cpp, vLLM, SGLang), and a SpecBuilder to construct detailed container specifications based on model and GPU options.
pkg/dmrlet/daemon/api.go
- New file implementing the APIServer for dmrlet, providing an HTTP API over a Unix socket for CLI commands to interact with the daemon, defining request/response structures for all daemon operations.
pkg/dmrlet/daemon/daemon.go
- New file containing the core Daemon orchestrator logic, integrating GPU management, container management, service discovery, health checking, autoscaling, log aggregation, and model store integration to manage ModelDeployment lifecycles.
pkg/dmrlet/gpu/allocator.go
- New file defining AllocationStrategy and Allocator for managing GPU allocation, including logic for parsing GPU specifications and allocating GPUs based on various strategies (all, single, specific, round-robin).
pkg/dmrlet/gpu/detector.go
- New file defining GPUType, GPU struct, and Detector for identifying available GPUs (NVIDIA, AMD, Apple Silicon) in the system using platform-specific tools like nvidia-smi or sysctl.
pkg/dmrlet/gpu/inventory.go
- New file defining Inventory to manage the detected GPUs, track their availability, and mark them as in use or available for allocation.
pkg/dmrlet/health/checker.go
- New file implementing a Checker for monitoring the health of deployed services, probing endpoints, updating health status in the service registry, and handling container restarts based on a configurable RestartPolicy.
pkg/dmrlet/logging/aggregator.go
- New file implementing a LogAggregator that collects, buffers, and streams logs from inference containers, utilizing a ring buffer for storage and providing methods for historical and streaming access.
pkg/dmrlet/service/registry.go
- New file implementing a Registry for service discovery, tracking deployed model containers, their endpoints, health status, and providing methods for registration, unregistration, and endpoint lookup with round-robin load balancing.
pkg/dmrlet/store/integration.go
- New file implementing Integration with the Docker Model Runner's local model store, providing functionality to locate model files, list available models, and retrieve model metadata.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

sourcery-ai

Hey - I've found 4 issues, and left some high level feedback:

In pkg/dmrlet/container/manager.go, NewManager takes a containerd address but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior.
In daemon.scaleUp you ignore errors from container.NewSpecBuilder and modelStore.GetModelPath (using _ for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas.
The daemon client’s error handling in pkg/dmrlet/daemon/api.go (Client.Serve) assumes the error body is JSON-decodable into a string, but the server uses http.Error (plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.

Prompt for AI Agents

Please address the comments from this code review:

## Overall Comments
- In `pkg/dmrlet/container/manager.go`, `NewManager` takes a `containerd` address but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior.
- In `daemon.scaleUp` you ignore errors from `container.NewSpecBuilder` and `modelStore.GetModelPath` (using `_` for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas.
- The daemon client’s error handling in `pkg/dmrlet/daemon/api.go` (`Client.Serve`) assumes the error body is JSON-decodable into a string, but the server uses `http.Error` (plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.

## Individual Comments

### Comment 1
<location> `pkg/dmrlet/daemon/daemon.go:590-593` </location>
<code_context>
+	return d.logAggregator.StreamLogs(context.Background(), deployment.Containers[0], lines, follow)
+}
+
+func (d *Daemon) allocatePort() int {
+	port := d.nextPort
+	d.nextPort++
+	return port
+}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Port allocation is not concurrency-safe and can race between Serve/scaleUp calls

allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.
</issue_to_address>

### Comment 2
<location> `pkg/dmrlet/daemon/api.go:500-503` </location>
<code_context>
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return c.fetchStatsEndpoint(ctx, endpoint)
+	}
</code_context>

<issue_to_address>
**issue (bug_risk):** Client.Serve error handling assumes JSON string body but server uses http.Error with plain text

Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.
</issue_to_address>

### Comment 3
<location> `README.md:427` </location>
<code_context>
+| Feature | Kubernetes | dmrlet |
+|---------|------------|--------|
+| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
+| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
+| Time to first inference | Minutes (pod scheduling, image pull) | Seconds (model already local) |
+| Model management | External (mount PVCs, manage yourself) | Integrated with Docker Model Runner store |
</code_context>

<issue_to_address>
**suggestion (typo):** Consider adding "of" for smoother grammar in this table entry.

Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.

```suggestion
| Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only |
```
</issue_to_address>

### Comment 4
<location> `README.md:502` </location>
<code_context>
+# DAEMON: running
+# SOCKET: /var/run/dmrlet.sock
+#
+# GPUS:
+#   GPU 0:  NVIDIA A100 80GB  81920MB  (in use: llama3.2)
+#   GPU 1:  NVIDIA A100 80GB  81920MB  (available)
</code_context>

<issue_to_address>
**issue (typo):** Typo: "GPUS" should be "GPUs".

In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.

```suggestion
# GPUs:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨

_{Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.}

sourcery-ai · 2026-02-03T23:26:18Z

pkg/dmrlet/daemon/daemon.go

+func (d *Daemon) allocatePort() int {
+	port := d.nextPort
+	d.nextPort++
+	return port


issue (bug_risk): Port allocation is not concurrency-safe and can race between Serve/scaleUp calls

allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.

sourcery-ai · 2026-02-03T23:26:18Z

pkg/dmrlet/daemon/api.go

+	if resp.StatusCode != http.StatusOK {
+		var errMsg string
+		json.NewDecoder(resp.Body).Decode(&errMsg)
+		return nil, fmt.Errorf("daemon error: %s", errMsg)


issue (bug_risk): Client.Serve error handling assumes JSON string body but server uses http.Error with plain text

Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.

sourcery-ai · 2026-02-03T23:26:18Z

README.md

+| Feature | Kubernetes | dmrlet |
+|---------|------------|--------|
+| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
+| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |


suggestion (typo): Consider adding "of" for smoother grammar in this table entry.

Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.

Suggested change

| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |

| Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only |

sourcery-ai · 2026-02-03T23:26:19Z

README.md

+# DAEMON: running
+# SOCKET: /var/run/dmrlet.sock
+#
+# GPUS:


issue (typo): Typo: "GPUS" should be "GPUs".

In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.

Suggested change

# GPUS:

# GPUs:

pkg/dmrlet/daemon/daemon.go

pkg/dmrlet/store/integration.go

gemini-code-assist

Code Review

This pull request introduces dmrlet, a new container orchestrator for AI inference. The changes are extensive, adding a new CLI tool and several backend packages for managing containers, GPUs, services, and more. The overall architecture is well-designed, with clear separation of concerns between components like the daemon, container manager, GPU allocator, and service registry.

My review focuses on improving the robustness and correctness of the implementation. I've identified a few high-priority issues, including the use of the docker CLI instead of the Go SDK which can be brittle, and some bugs in the API client related to log streaming and error handling. I've also included some medium-severity suggestions to address potential race conditions, incomplete features, and hardcoded values.

Overall, this is a great addition with a solid foundation. Addressing these points will make dmrlet more reliable and maintainable.

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/container/manager.go

+// DockerRuntime implements Runtime using Docker CLI.
+type DockerRuntime struct{}


The DockerRuntime implementation relies on shelling out to the docker CLI. This approach is brittle and can lead to issues:

Fragility: It depends on the docker binary being in the system's PATH.

Parsing Instability: Methods like Inspect and List parse the text output of Docker commands. This output is not a stable API and can change between Docker versions, which would break dmrlet.

Security: While there are no obvious command injections with the current usage, shelling out is generally less secure than using a proper API.

A more robust and maintainable solution would be to use the official Docker Go SDK (github.com/docker/docker/client). It provides a stable, typed API for interacting with the Docker daemon, eliminating the need for command execution and output parsing.

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/daemon/api.go

+	if resp.StatusCode != http.StatusOK {
+		var errMsg string
+		json.NewDecoder(resp.Body).Decode(&errMsg)
+		return nil, fmt.Errorf("daemon error: %s", errMsg)
+	}


When an error occurs on the server, http.Error is used, which writes a plain text response. However, the client attempts to decode the error response as JSON. This will fail and result in an unhelpful error message for the user. The client should read the response body as plain text to get the actual error message from the server.

if resp.StatusCode != http.StatusOK { body, _ := io.ReadAll(resp.Body) return nil, fmt.Errorf("daemon error: %s", string(body)) }

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/daemon/api.go

+		buf := make([]byte, 4096)
+		for {
+			n, err := resp.Body.Read(buf)
+			if n > 0 {
+				// Parse and send log lines
+				// This is simplified - real implementation would properly parse
+				ch <- logging.LogLine{
+					Message: string(buf[:n]),
+				}
+			}
+			if err != nil {
+				return
+			}
+		}


The StreamLogs client implementation reads raw byte chunks from the HTTP response body. This can lead to garbled or incomplete log lines in the output, as a single log message might be split across multiple reads, or multiple small messages might be combined into one. It also doesn't handle client-side cancellation during streaming.

To ensure each log line is processed correctly, you should use a bufio.Scanner to read the stream line-by-line and check the context in the loop to make it more robust.

scanner := bufio.NewScanner(resp.Body) for scanner.Scan() { select { case <-ctx.Done(): return case ch <- logging.LogLine{Message: scanner.Text() + "\n"}: } }

gemini-code-assist · 2026-02-03T23:28:50Z

cmd/dmrlet/commands/logs.go

+		if line.Timestamp.IsZero() {
+			fmt.Print(line.Message)
+		} else {
+			fmt.Printf("[%s] %s\n", line.Timestamp.Format("2006-01-02 15:04:05"), line.Message)
+		}


This logic for printing logs is more complex than necessary. The associated client.StreamLogs implementation sends pre-formatted lines (once a related issue is fixed). Therefore, this loop can be simplified to just print the received message directly.

fmt.Print(line.Message)

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/autoscaler/metrics.go

+	// Parse Prometheus format metrics
+	// This is simplified - real implementation would use prometheus client
+	return endpointMetrics{}, nil


The implementation for parsing Prometheus metrics is currently a stub and does not actually parse any metrics. This means that autoscaling will not work correctly for backends that only expose Prometheus metrics. This should be implemented to provide full metrics support.

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/container/spec.go

+		modelFile := "/models"
+		// For llama.cpp, we need to specify the .gguf file
+		if b.config.Backend == BackendLlamaCpp {
+			modelFile = "/models/model.gguf"


The model file path is hardcoded to /models/model.gguf for the llama.cpp backend. This assumes that the model file within the mounted directory is always named model.gguf. This might not always be the case, making the system brittle. It would be more robust to discover the actual model filename from the model store or make it configurable.

gemini-code-assist · 2026-02-03T23:28:50Z

pkg/dmrlet/daemon/daemon.go

+	// Create containers for each replica
+	for i := 0; i < replicas; i++ {
+		port := d.allocatePort()
+		containerID := fmt.Sprintf("%s-%d", sanitizeID(config.Model), i)


The container ID is generated using a format string "%s-%d". When scaling up and down, if a container with a specific index is removed and then a new one is created, it might get the same index, leading to the same container ID. If the old container is not fully removed by the runtime yet, this can cause a name conflict.

Consider using a more robust method for generating unique container IDs, such as appending a short random string or a timestamp, to avoid potential race conditions.

dmrlet is a "Kubelet for AI" that runs inference containers directly with zero YAML overhead. It provides a simple CLI to serve models: dmrlet serve ai/smollm2 # Pulls model, starts inference container, exposes OpenAI API Key features: - Reuses existing pkg/distribution for model management - containerd integration for container lifecycle - GPU detection and passthrough (NVIDIA/AMD) - Auto port allocation (30000-30999 range) - Health checking with configurable timeout - Backend auto-detection (llama-server for GGUF, vLLM for safetensors) Commands: serve, stop, list, pull, version Signed-off-by: Eric Curtin <eric.curtin@docker.com>

sourcery-ai bot reviewed Feb 3, 2026

View reviewed changes

github-advanced-security bot found potential problems Feb 3, 2026

View reviewed changes

pkg/dmrlet/daemon/daemon.go Fixed Show fixed Hide fixed

pkg/dmrlet/daemon/daemon.go Fixed Show fixed Hide fixed

pkg/dmrlet/store/integration.go Fixed Show fixed Hide fixed

gemini-code-assist bot reviewed Feb 3, 2026

View reviewed changes

ericcurtin force-pushed the dmrlet1 branch from ba47a16 to 644f3d2 Compare February 4, 2026 16:05

ericcurtin changed the title ~~Add dmrlet container orchestrator for AI inference~~ add dmrlet - lightweight node agent for Docker Model Runner Feb 4, 2026

	\| Config overhead \| 50+ lines YAML minimum \| Zero YAML, CLI-only \|
	\| Config overhead \| 50+ lines of YAML minimum \| Zero YAML, CLI-only \|

		// DockerRuntime implements Runtime using Docker CLI.
		type DockerRuntime struct{}

add dmrlet - lightweight node agent for Docker Model Runner #627

Are you sure you want to change the base?

add dmrlet - lightweight node agent for Docker Model Runner #627

Conversation

ericcurtin commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pulls model, starts inference container, exposes OpenAI API

Uh oh!

gemini-code-assist bot commented Feb 3, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

sourcery-ai bot left a comment

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

sourcery-ai bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ericcurtin commented Feb 3, 2026 •

edited

Loading