An Fedora 43 Docker/Podman container that is Toolbx-compatible (usable as a Fedora toolbox) for serving LLMs with vLLM on AMD Ryzen AI Max “Strix Halo” (gfx1151). Built on the TheRock nightly builds for ROCm.
Update: This toolbox now ships with a custom build of ROCm/RCCL that enables native RDMA/RoCE v2 support for Strix Halo (gfx1151). This allows you to connect two nodes via a low-latency interconnect (e.g., Intel E810) and run vLLM with Tensor Parallelism (TP=2) effectively acting as a single 256GB Unified Memory GPU.
👉 Read the Full RDMA Cluster Setup Guide for hardware requirements and configuration instructions.
This repository is part of the Strix Halo AI Toolboxes project. Check out the website for an overview of all toolboxes, tutorials, and host configuration guides.
This is a hobby project maintained in my spare time. If you find these toolboxes and tutorials useful, you can buy me a coffee to support the work! ☕
- Tested Models (Benchmarks)
- 1) Toolbx vs Docker/Podman
- 2) Quickstart — Fedora Toolbx
- 3) Quickstart — Ubuntu (Distrobox)
- 4) Testing the API
- 5) Use a Web UI for Chatting
- 6) Host Configuration
- 7) Distributed Clustering (RDMA/RoCE)
View full benchmarks at: https://kyuz0.github.io/amd-strix-halo-vllm-toolboxes/
Table Key: Cell values represent Max Context Length (GPU Memory Utilization).
| Model | TP | 1 Req | 4 Reqs | 8 Reqs | 16 Reqs |
|---|---|---|---|---|---|
meta-llama/Meta-Llama-3.1-8B-Instruct |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
google/gemma-3-12b-it |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
openai/gpt-oss-20b |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
Qwen/Qwen3-14B-AWQ |
1 | 40k (0.95) | 40k (0.95) | 40k (0.95) | 40k (0.95) |
btbtyler09/Qwen3-Coder-30B-A3B-Instruct-gptq-4bit |
1 | 256k (0.95) | 256k (0.95) | 256k (0.95) | 256k (0.95) |
btbtyler09/Qwen3-Coder-30B-A3B-Instruct-gptq-8bit |
1 | 256k (0.95) | 256k (0.95) | 256k (0.95) | 256k (0.95) |
dazipe/Qwen3-Next-80B-A3B-Instruct-GPTQ-Int4A16 |
1 | 256k (0.95) | 256k (0.95) | 256k (0.95) | 256k (0.95) |
openai/gpt-oss-120b |
1 | 128k (0.95) | 128k (0.95) | 128k (0.95) | 128k (0.95) |
zai-org/GLM-4.7-Flash |
1 | 198k (0.95) | 198k (0.95) | 198k (0.95) | 198k (0.95) |
The kyuz0/vllm-therock-gfx1151:latest image can be used both as:
- Fedora Toolbx (recommended for development): Toolbx shares your HOME and user, so models/configs live on the host. Great for iterating quickly while keeping the host clean.
- Docker/Podman (recommended for deployment/perf): Use for running vLLM as a service (host networking, IPC tuning, etc.). Always mount a host directory for model weights so they stay outside the container.
Recommended: Use the included refresh_toolbox.sh script. It pulls the latest image and creates the toolbox with the correct parameters:
./refresh_toolbox.shInfiniBand / RDMA Support: The script automatically detects if a fast InfiniBand link is active (checks
/dev/infiniband). If found, it correctly sets up the container to expose these devices, enabling high-performance clustering.
Manual Creation:
To manually create a toolbox that exposes the GPU and relaxes seccomp:
toolbox create vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
-- --device /dev/dri --device /dev/kfd \
--group-add video --group-add render --security-opt seccomp=unconfinedEnter it:
toolbox enter vllmModel storage: Models are downloaded to ~/.cache/huggingface by default. This directory is shared with the host if you created the toolbox correctly, so downloads persist.
The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
start-vllmCache note: vLLM writes compiled kernels to
~/.cache/vllm/.
Ubuntu’s toolbox package still breaks GPU access, so use Distrobox instead:
distrobox create -n vllm \
--image docker.io/kyuz0/vllm-therock-gfx1151:latest \
--additional-flags "--device /dev/kfd --device /dev/dri --group-add video --group-add render --security-opt seccomp=unconfined"
distrobox enter vllmVerification: Run
rocm-smito check GPU status.
The toolbox includes a TUI wizard called start-vllm which includes pre-configured models and handles the launch flags for you. This is the easiest way to get started.
start-vllmOnce the server is up, hit the OpenAI‑compatible endpoint:
curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"Qwen/Qwen2.5-7B-Instruct","messages":[{"role":"user","content":"Hello! Test the performance."}]}'You should receive a JSON response with a choices[0].message.content reply.
If you don't want to bother specifying the model name, you can run this which will query the currently deployed model:
MODEL=$(curl -s http://localhost:8000/v1/models | jq -r '.data[0].id') curl -X POST http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d "{
\"model\": \"$MODEL\",
\"messages\":[{\"role\":\"user\",\"content\":\"Hello! Test the performance.\"}]
}"If vLLM is on a remote server, expose port 8000 via SSH port forwarding:
ssh -L 0.0.0.0:8000:localhost:8000 <vllm-host>Then, you can start HuggingFace ChatUI like this (on your host):
docker run -p 3000:3000 \
--add-host=host.docker.internal:host-gateway \
-e OPENAI_BASE_URL=http://host.docker.internal:8000/v1 \
-e OPENAI_API_KEY=dummy \
-v chat-ui-data:/data \
ghcr.io/huggingface/chat-ui-dbThis should work on any Strix Halo. For a complete list of available hardware, see: Strix Halo Hardware Database
| Component | Specification |
|---|---|
| Test Machine | Framework Desktop |
| CPU | Ryzen AI MAX+ 395 "Strix Halo" |
| System Memory | 128 GB RAM |
| GPU Memory | 512 MB allocated in BIOS |
| Host OS | Fedora 43, Linux 6.18.5-200.fc43.x86_64 |
Add these boot parameters to enable unified memory while reserving a minimum of 4 GiB for the OS (max 124 GiB for iGPU):
iommu=pt amdgpu.gttsize=126976 ttm.pages_limit=32505856
| Parameter | Purpose |
|---|---|
iommu=pt |
Sets IOMMU to "Pass-Through" mode. This helps performance, reducing overhead for both the RDMA NIC and the iGPU unified memory access. |
amdgpu.gttsize=126976 |
Caps GPU unified memory to 124 GiB; 126976 MiB ÷ 1024 = 124 GiB |
ttm.pages_limit=32505856 |
Caps pinned memory to 124 GiB; 32505856 × 4 KiB = 126976 MiB = 124 GiB |
Apply the changes:
# Edit /etc/default/grub to add parameters to GRUB_CMDLINE_LINUX
sudo grub2-mkconfig -o /boot/grub2/grub.cfg
sudo reboot
This toolbox supports high-performance clustering of multiple Strix Halo nodes using Infiniband or RoCE v2 (e.g., Intel E810). This enables Tensor Parallelism across machines with extremely low latency (~5µs).
Detailed Documentation: RDMA Cluster Setup Guide
Key Features:
- Custom RCCL Patch: Use of a custom-built
librccl.soto support RDMA ongfx1151. - Easy Setup:
refresh_toolbox.shautomatically detects and exposes RDMA devices. - Cluster Management: Included
start-vllm-clusterTUI for managing Ray and vLLM.