Skip to content

Conversation

@ericcurtin
Copy link
Contributor

@ericcurtin ericcurtin commented Feb 3, 2026

dmrlet is a "Kubelet for AI" that runs inference containers directly
with zero YAML overhead. It provides a simple CLI to serve models:

dmrlet serve ai/smollm2

Pulls model, starts inference container, exposes OpenAI API

Key features:

  • Reuses existing pkg/distribution for model management
  • containerd integration for container lifecycle
  • GPU detection and passthrough (NVIDIA/AMD)
  • Auto port allocation (30000-30999 range)
  • Health checking with configurable timeout
  • Backend auto-detection (llama-server for GGUF, vLLM for safetensors)

Commands: serve, stop, list, pull, version

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @ericcurtin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates dmrlet, a new, specialized container orchestrator tailored for AI inference. The primary goal is to streamline the deployment and operational management of AI models by abstracting away the complexities often associated with general-purpose container orchestration. It focuses on providing an intuitive experience for handling GPU resources and scaling inference services, allowing users to quickly get models into production with minimal setup.

Highlights

  • New Container Orchestrator: Introduces dmrlet, a purpose-built container orchestrator specifically designed for AI inference workloads, offering a simplified alternative to general-purpose orchestrators.
  • Zero-Configuration Multi-GPU Support: Enables automatic multi-GPU mapping and management without requiring complex YAML configurations, device plugins, or node selectors, making GPU allocation 'just work'.
  • Flexible Backend Integration: Supports multiple popular AI inference backends, including llama.cpp, vLLM, and SGLang, with automatic backend detection based on model format.
  • Integrated Auto-Scaling: Implements auto-scaling capabilities for inference models based on key metrics such as Queries Per Second (QPS), latency, and GPU utilization, ensuring efficient resource usage.
  • Comprehensive CLI Tooling: Provides a full suite of command-line interface (CLI) commands for managing the dmrlet daemon, serving models, listing running instances, viewing logs, scaling replicas, stopping models, and checking system status.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • README.md
    • Added extensive documentation for dmrlet, covering its features, build process, usage examples for all CLI commands, supported inference backends, and architectural overview.
  • cmd/dmrlet/commands/daemon.go
    • New file implementing the dmrlet daemon command, which starts and manages the core orchestrator process, including configuration for socket path, containerd address, and model store, with graceful shutdown handling.
  • cmd/dmrlet/commands/logs.go
    • New file implementing the dmrlet logs command, allowing users to view and follow real-time logs from inference containers for a specified model.
  • cmd/dmrlet/commands/ps.go
    • New file implementing the dmrlet ps command, which lists all running inference models, displaying details such as backend, replica count, assigned GPUs, endpoints, and current status.
  • cmd/dmrlet/commands/root.go
    • New file defining the root command for the dmrlet CLI, setting up persistent flags and registering all subcommands for daemon management and model operations.
  • cmd/dmrlet/commands/scale.go
    • New file implementing the dmrlet scale command, enabling users to dynamically adjust the number of replicas for a deployed model.
  • cmd/dmrlet/commands/serve.go
    • New file implementing the dmrlet serve command, which initiates the serving of an AI model, allowing specification of the inference backend, GPU allocation, replica count, and backend-specific parameters.
  • cmd/dmrlet/commands/status.go
    • New file implementing the dmrlet status command, providing a summary of the dmrlet daemon's operational state, detected GPU inventory, and the number of active models.
  • cmd/dmrlet/commands/stop.go
    • New file implementing the dmrlet stop command, used to terminate a specific deployed model or all running models.
  • cmd/dmrlet/main.go
    • New file serving as the entry point for the dmrlet CLI application.
  • go.mod
    • Updated Go module dependencies, including downgrading github.com/containerd/containerd/v2 and github.com/containerd/platforms versions.
    • Added github.com/spf13/cobra and github.com/spf13/pflag as direct dependencies, along with several new indirect dependencies.
  • go.sum
    • Updated Go module checksums to reflect the changes in go.mod, including new indirect dependencies and version adjustments.
  • pkg/dmrlet/autoscaler/metrics.go
    • New file defining the Metrics struct and Collector for gathering performance metrics (QPS, latency, GPU utilization) from inference containers, including fetching from HTTP endpoints and nvidia-smi.
  • pkg/dmrlet/autoscaler/scaler.go
    • New file defining the Scaler for auto-scaling models based on collected metrics, incorporating ScalingConfig, ScaleAction, and logic for evaluating scaling decisions with cooldowns and delays.
  • pkg/dmrlet/container/manager.go
    • New file implementing a Manager for container lifecycle management, currently utilizing the Docker CLI as its backend, handling creation, starting, stopping, removal, restarting, and log attachment for containers.
  • pkg/dmrlet/container/spec.go
    • New file defining Backend types, BackendConfig for various inference backends (llama.cpp, vLLM, SGLang), and a SpecBuilder to construct detailed container specifications based on model and GPU options.
  • pkg/dmrlet/daemon/api.go
    • New file implementing the APIServer for dmrlet, providing an HTTP API over a Unix socket for CLI commands to interact with the daemon, defining request/response structures for all daemon operations.
  • pkg/dmrlet/daemon/daemon.go
    • New file containing the core Daemon orchestrator logic, integrating GPU management, container management, service discovery, health checking, autoscaling, log aggregation, and model store integration to manage ModelDeployment lifecycles.
  • pkg/dmrlet/gpu/allocator.go
    • New file defining AllocationStrategy and Allocator for managing GPU allocation, including logic for parsing GPU specifications and allocating GPUs based on various strategies (all, single, specific, round-robin).
  • pkg/dmrlet/gpu/detector.go
    • New file defining GPUType, GPU struct, and Detector for identifying available GPUs (NVIDIA, AMD, Apple Silicon) in the system using platform-specific tools like nvidia-smi or sysctl.
  • pkg/dmrlet/gpu/inventory.go
    • New file defining Inventory to manage the detected GPUs, track their availability, and mark them as in use or available for allocation.
  • pkg/dmrlet/health/checker.go
    • New file implementing a Checker for monitoring the health of deployed services, probing endpoints, updating health status in the service registry, and handling container restarts based on a configurable RestartPolicy.
  • pkg/dmrlet/logging/aggregator.go
    • New file implementing a LogAggregator that collects, buffers, and streams logs from inference containers, utilizing a ring buffer for storage and providing methods for historical and streaming access.
  • pkg/dmrlet/service/registry.go
    • New file implementing a Registry for service discovery, tracking deployed model containers, their endpoints, health status, and providing methods for registration, unregistration, and endpoint lookup with round-robin load balancing.
  • pkg/dmrlet/store/integration.go
    • New file implementing Integration with the Docker Model Runner's local model store, providing functionality to locate model files, list available models, and retrieve model metadata.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey - I've found 4 issues, and left some high level feedback:

  • In pkg/dmrlet/container/manager.go, NewManager takes a containerd address but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior.
  • In daemon.scaleUp you ignore errors from container.NewSpecBuilder and modelStore.GetModelPath (using _ for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas.
  • The daemon client’s error handling in pkg/dmrlet/daemon/api.go (Client.Serve) assumes the error body is JSON-decodable into a string, but the server uses http.Error (plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.
Prompt for AI Agents
Please address the comments from this code review:

## Overall Comments
- In `pkg/dmrlet/container/manager.go`, `NewManager` takes a `containerd` address but unconditionally selects a Docker CLI runtime and ignores the address, which is confusing given the daemon config and README; consider either wiring the address into an actual containerd-based runtime or renaming/removing the parameter to match the current behavior.
- In `daemon.scaleUp` you ignore errors from `container.NewSpecBuilder` and `modelStore.GetModelPath` (using `_` for the error), which can lead to creating containers with an empty model path or an unsupported backend; propagate or handle these errors so scaling up fails fast instead of silently misconfiguring replicas.
- The daemon client’s error handling in `pkg/dmrlet/daemon/api.go` (`Client.Serve`) assumes the error body is JSON-decodable into a string, but the server uses `http.Error` (plain text), so the decode will likely fail and drop the real message; consider reading the body as raw bytes for non-200 responses and returning that content directly in the error.

## Individual Comments

### Comment 1
<location> `pkg/dmrlet/daemon/daemon.go:590-593` </location>
<code_context>
+	return d.logAggregator.StreamLogs(context.Background(), deployment.Containers[0], lines, follow)
+}
+
+func (d *Daemon) allocatePort() int {
+	port := d.nextPort
+	d.nextPort++
+	return port
+}
+
</code_context>

<issue_to_address>
**issue (bug_risk):** Port allocation is not concurrency-safe and can race between Serve/scaleUp calls

allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.
</issue_to_address>

### Comment 2
<location> `pkg/dmrlet/daemon/api.go:500-503` </location>
<code_context>
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		return c.fetchStatsEndpoint(ctx, endpoint)
+	}
</code_context>

<issue_to_address>
**issue (bug_risk):** Client.Serve error handling assumes JSON string body but server uses http.Error with plain text

Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.
</issue_to_address>

### Comment 3
<location> `README.md:427` </location>
<code_context>
+| Feature | Kubernetes | dmrlet |
+|---------|------------|--------|
+| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
+| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
+| Time to first inference | Minutes (pod scheduling, image pull) | Seconds (model already local) |
+| Model management | External (mount PVCs, manage yourself) | Integrated with Docker Model Runner store |
</code_context>

<issue_to_address>
**suggestion (typo):** Consider adding "of" for smoother grammar in this table entry.

Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.

```suggestion
| Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only |
```
</issue_to_address>

### Comment 4
<location> `README.md:502` </location>
<code_context>
+# DAEMON: running
+# SOCKET: /var/run/dmrlet.sock
+#
+# GPUS:
+#   GPU 0:  NVIDIA A100 80GB  81920MB  (in use: llama3.2)
+#   GPU 1:  NVIDIA A100 80GB  81920MB  (available)
</code_context>

<issue_to_address>
**issue (typo):** Typo: "GPUS" should be "GPUs".

In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.

```suggestion
# GPUs:
```
</issue_to_address>

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.

Comment on lines 590 to 593
func (d *Daemon) allocatePort() int {
port := d.nextPort
d.nextPort++
return port
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Port allocation is not concurrency-safe and can race between Serve/scaleUp calls

allocatePort updates d.nextPort without synchronization. Serve calls it under d.mu, but scaleUp calls it without holding the lock, so concurrent calls can race and assign duplicate ports. Please protect nextPort with d.mu (or an atomic), or move port assignment under a shared lock so all callers synchronize consistently.

Comment on lines 500 to 503
if resp.StatusCode != http.StatusOK {
var errMsg string
json.NewDecoder(resp.Body).Decode(&errMsg)
return nil, fmt.Errorf("daemon error: %s", errMsg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Client.Serve error handling assumes JSON string body but server uses http.Error with plain text

Because the server uses http.Error for non-200 responses, the body is plain text. Decoding it as JSON into a string will usually fail and drop the real error message. Instead, read resp.Body as raw bytes and surface that content in the error (falling back to resp.Status if the body is empty), and apply the same pattern to the other client methods that ignore the error body.

README.md Outdated
| Feature | Kubernetes | dmrlet |
|---------|------------|--------|
| Multi-GPU setup | Device plugins + node selectors + resource limits YAML | `dmrlet serve llama3 --gpus all` |
| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (typo): Consider adding "of" for smoother grammar in this table entry.

Change “50+ lines YAML minimum” to “50+ lines of YAML minimum” or “at least 50 lines of YAML” for clearer grammar.

Suggested change
| Config overhead | 50+ lines YAML minimum | Zero YAML, CLI-only |
| Config overhead | 50+ lines of YAML minimum | Zero YAML, CLI-only |

README.md Outdated
# DAEMON: running
# SOCKET: /var/run/dmrlet.sock
#
# GPUS:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (typo): Typo: "GPUS" should be "GPUs".

In the status example, change the header label from "GPUS" to "GPUs" to use the correct plural form.

Suggested change
# GPUS:
# GPUs:

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces dmrlet, a new container orchestrator for AI inference. The changes are extensive, adding a new CLI tool and several backend packages for managing containers, GPUs, services, and more. The overall architecture is well-designed, with clear separation of concerns between components like the daemon, container manager, GPU allocator, and service registry.

My review focuses on improving the robustness and correctness of the implementation. I've identified a few high-priority issues, including the use of the docker CLI instead of the Go SDK which can be brittle, and some bugs in the API client related to log streaming and error handling. I've also included some medium-severity suggestions to address potential race conditions, incomplete features, and hardcoded values.

Overall, this is a great addition with a solid foundation. Addressing these points will make dmrlet more reliable and maintainable.

Comment on lines 263 to 264
// DockerRuntime implements Runtime using Docker CLI.
type DockerRuntime struct{}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The DockerRuntime implementation relies on shelling out to the docker CLI. This approach is brittle and can lead to issues:

  • Fragility: It depends on the docker binary being in the system's PATH.
  • Parsing Instability: Methods like Inspect and List parse the text output of Docker commands. This output is not a stable API and can change between Docker versions, which would break dmrlet.
  • Security: While there are no obvious command injections with the current usage, shelling out is generally less secure than using a proper API.

A more robust and maintainable solution would be to use the official Docker Go SDK (github.com/docker/docker/client). It provides a stable, typed API for interacting with the Docker daemon, eliminating the need for command execution and output parsing.

Comment on lines 500 to 504
if resp.StatusCode != http.StatusOK {
var errMsg string
json.NewDecoder(resp.Body).Decode(&errMsg)
return nil, fmt.Errorf("daemon error: %s", errMsg)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

When an error occurs on the server, http.Error is used, which writes a plain text response. However, the client attempts to decode the error response as JSON. This will fail and result in an unhelpful error message for the user. The client should read the response body as plain text to get the actual error message from the server.

if resp.StatusCode != http.StatusOK {
	body, _ := io.ReadAll(resp.Body)
	return nil, fmt.Errorf("daemon error: %s", string(body))
}

Comment on lines 642 to 655
buf := make([]byte, 4096)
for {
n, err := resp.Body.Read(buf)
if n > 0 {
// Parse and send log lines
// This is simplified - real implementation would properly parse
ch <- logging.LogLine{
Message: string(buf[:n]),
}
}
if err != nil {
return
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The StreamLogs client implementation reads raw byte chunks from the HTTP response body. This can lead to garbled or incomplete log lines in the output, as a single log message might be split across multiple reads, or multiple small messages might be combined into one. It also doesn't handle client-side cancellation during streaming.

To ensure each log line is processed correctly, you should use a bufio.Scanner to read the stream line-by-line and check the context in the loop to make it more robust.

scanner := bufio.NewScanner(resp.Body)
for scanner.Scan() {
	select {
	case <-ctx.Done():
		return
	case ch <- logging.LogLine{Message: scanner.Text() + "\n"}:
	}
}

Comment on lines 66 to 70
if line.Timestamp.IsZero() {
fmt.Print(line.Message)
} else {
fmt.Printf("[%s] %s\n", line.Timestamp.Format("2006-01-02 15:04:05"), line.Message)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for printing logs is more complex than necessary. The associated client.StreamLogs implementation sends pre-formatted lines (once a related issue is fixed). Therefore, this loop can be simplified to just print the received message directly.

fmt.Print(line.Message)

Comment on lines 163 to 165
// Parse Prometheus format metrics
// This is simplified - real implementation would use prometheus client
return endpointMetrics{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The implementation for parsing Prometheus metrics is currently a stub and does not actually parse any metrics. This means that autoscaling will not work correctly for backends that only expose Prometheus metrics. This should be implemented to provide full metrics support.

modelFile := "/models"
// For llama.cpp, we need to specify the .gguf file
if b.config.Backend == BackendLlamaCpp {
modelFile = "/models/model.gguf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The model file path is hardcoded to /models/model.gguf for the llama.cpp backend. This assumes that the model file within the mounted directory is always named model.gguf. This might not always be the case, making the system brittle. It would be more robust to discover the actual model filename from the model store or make it configurable.

// Create containers for each replica
for i := 0; i < replicas; i++ {
port := d.allocatePort()
containerID := fmt.Sprintf("%s-%d", sanitizeID(config.Model), i)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The container ID is generated using a format string "%s-%d". When scaling up and down, if a container with a specific index is removed and then a new one is created, it might get the same index, leading to the same container ID. If the old container is not fully removed by the runtime yet, this can cause a name conflict.

Consider using a more robust method for generating unique container IDs, such as appending a short random string or a timestamp, to avoid potential race conditions.

dmrlet is a "Kubelet for AI" that runs inference containers directly
with zero YAML overhead. It provides a simple CLI to serve models:

  dmrlet serve ai/smollm2
  # Pulls model, starts inference container, exposes OpenAI API

Key features:
- Reuses existing pkg/distribution for model management
- containerd integration for container lifecycle
- GPU detection and passthrough (NVIDIA/AMD)
- Auto port allocation (30000-30999 range)
- Health checking with configurable timeout
- Backend auto-detection (llama-server for GGUF, vLLM for safetensors)

Commands: serve, stop, list, pull, version

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
@ericcurtin ericcurtin changed the title Add dmrlet container orchestrator for AI inference add dmrlet - lightweight node agent for Docker Model Runner Feb 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants