Skip to content

Releases: ai-dynamo/dynamo

Dynamo v0.9.0

12 Feb 03:46
76c1889

Choose a tag to compare

Dynamo v0.9.0 Release Notes

Summary

Dynamo v0.9.0 completes the infrastructure decoupling started in v0.8.0, expands multimodal and diffusion model support across all three backends, and introduces smarter scheduling with predictive load estimation and routing hints.

Infrastructure Modernization

The new Event Plane—built on high-performance ZMQ transport with MessagePack serialization—joins the Discovery Plane and Request Plane to form a fully decoupled communication architecture. Dynamo deployments no longer require NATS or etcd: Kubernetes-native service discovery replaces etcd, KV router queries run over the native Dynamo endpoint instead of NATS, and the Event Plane provides a transport-agnostic pub/sub layer for system events. These changes simplify deployment topology and reduce operational dependencies.

Multimodal & Diffusion

Dynamo expanded multimodal support across all three backends in this release. Encoder disaggregation is now available for both vLLM (via the Embedding Cache connector) and TRT-LLM (via a standalone encoder), allowing encoding to run on a separate GPU from prefill/decode. Dynamo can now serve multimodal SGLang workloads on a single GPU instead of requiring a full E/PD split. We also added first-class support for diffusion-based language models — LLaDA2.0 can now be served alongside autoregressive models in the same Dynamo deployment.

Scheduling Intelligence

Router gained output block tracking with fractional decay for predictive load estimation, expected output token awareness, and support for routing hints from external orchestrators like Kubernetes Gateway API Inference Extension (GAIE). The Planner added Kalman filter and mooncake-style warmup for more accurate load prediction, along with SLA-driven autoscaling for MoE DEP/TEP configurations. The Profiler was enhanced with PVC model cache support and model name validation.

Kubernetes & Observability

Operator added rollout restart for DynamoGraphDeployments, observability metrics, tolerations/affinity for GPU-specific scheduling, and improved restart reliability. Distributed tracing now spans the full request path including TCP transport, and the Prometheus metrics stack was simplified with multi-registry scrape support.


First-Time Contributors

We welcome 14 new contributors to the Dynamo project:

  • @siclait contributed a PR that truncates HttpError messages to 8192 characters to prevent ValueError on long messages (#5020).
  • @smatta-star contributed a PR that adds auto-generated OpenAPI spec and helper binary for the frontend (#4802).
  • @shpgy-shpgy contributed a PR that fixes multimodal processing error when handling pure text conversations (#5088).
  • @chay1045 contributed a PR that fixes hidden stop tokens appearing in output by returning None instead (#5238).
  • @wenqiglantz contributed a PR that adds prompt embeds support for pre-computed inference inputs in vLLM (#4739).
  • @yurekami contributed a PR that preserves original model path for frontend config downloads (#5102).
  • @erezzarum contributed a PR that fixes NIXL CUDA12 + CUDA13 build compatibility (#5000).
  • @soodoshll contributed a PR that fixes usage returning None when using text mode with vLLM (#5336).
  • @ls-2018 contributed a PR that fixes tag error handling (#5236).
  • @debermudez contributed a PR that updates aiperf to v0.4.0 (#5331).
  • @wangshangsam contributed a PR that updates vLLM import paths to align with upstream main (#5447).
  • @AbhiOnGithub contributed a PR that adds __all__ exports and __repr__ methods for improved debugging (#5606).
  • @davilu-nvidia contributed a PR that resolves SGLang E/P/D multimodal routing issues (#5500).
  • @adityapuranik99 contributed a PR that adds cupy-cuda12x to SGLang extras for CUDA compatibility (#5627).

Major Features & Improvements

Infrastructure Modernization

Discovery Plane

  • K8s-Native Service Discovery: Enabled Kubernetes-based discovery in GAIE and updated Helm charts/RBAC to support etcd-less deployments, allowing Kubernetes users to deploy without running a separate etcd cluster (#5303, #5432, #5364).
  • etcd Reliability: Resolved potential deadlocks in legacy etcd usage and updated examples to run without etcd, ensuring stable startup for users still on etcd-based discovery (#5091, #5422).
  • List-and-Watch Diffing: Resolved diffing logic issue where worker metadata updates (e.g., LoRA adapter additions) were not picked up, causing stale routing decisions (#5318).

Request Plane

  • NATS Dependency Removal: Migrated KV router worker queries to the native Dynamo endpoint to reduce NATS traffic (#5451), made NATS optional for KV-aware routing in approximate mode so local development works without a NATS server (#5237), fixed NATS container startup failure caused by invalid --max_payload CLI flag by moving it to config file (#5384), and cleaned up asymmetric request plane configuration in launch scripts (#5245).

Event Plane

  • Event Plane Architecture: Introduced a transport-agnostic Event Plane with MessagePack serialization and auto-discovery, decoupling system events (KV cache transfers, notifications) from direct NATS dependency. Added high-performance ZMQ transport as a scalable alternative for latency-sensitive event channels while preserving NATS for backward compatibility (#5674, #5614, #5624).
  • Event Plane NATS Init: Corrected NATS initialization logic based on --event-plane argument across all backends, preventing silent failures when NATS is not configured (#5750).
  • ZMQ Transport Timeout: Added receive timeout for ZMQ transport to prevent indefinite hangs when a publisher is unavailable (#5804).

Networking

  • IPv6 Support: Added IPv6 support for SGLang disaggregation with proper address formatting, enabling deployments on IPv6-only networks (#5521).

Multimodal & Diffusion

SGLang

  • Aggregated Multimodal: Enabled Dynamo to serve multimodal SGLang workloads on a single GPU, removing the previous requirement for a 2-GPU E/PD split (#5450).
  • Diffusion LM Support: Enabled Dynamo to serve diffusion-based language models (LLaDA2.0) through the SGLang backend, using existing Dynamo infrastructure for pre/post processing with a new diffusion handler (#5533).
  • Multi-Image Qwen EC: Resolved multi-image bug in the Dynamo EC connector that dropped images beyond the first in multimodal requests (#5514).

TensorRT-LLM

  • Standalone Encoder: Added encoder disaggregation support to Dynamo's TRT-LLM integration, enabling encoding to run on a separate GPU from prefill/decode (#4668).
  • Multimodal Tokenizer Reuse: Optimized Dynamo's multimodal request pipeline for TRT-LLM by reusing the tokenizer across requests instead of reinitializing per request, reducing per-request latency (#5217).

vLLM

  • Embedding Cache Connector: Added the Embedding Cache (EC) connector to Dynamo's vLLM integration for encoder disaggregation, where the encoder stores embeddings by hash and PD workers consume them from cache—eliminating redundant encoding and reducing TTFT. Also enabled multiple image inputs per request and parallelized image loading (#5162, #5463, #5444).
  • Prompt Embeds Support: Added pre-computed embeddings as a secure input method to Dynamo, allowing applications to transform sensitive data into embeddings before submission for improved privacy and flexible prompt engineering (#4739).
  • EPD Refactor: Refactored Dynamo's EPD handler to orchestrate the full encode-to-PD flow (processor → encoder → processor → PD), supporting multiple multimodal data items per request instead of just one (#4994).
  • Decode Worker Qwen-VL: Resolved disaggregated decode crash for Qwen2.5-VL models caused by missing image_grid_thw data needed for mRoPE position encoding (#5281).
  • EPD Sampling Params: Corrected sampling params parsing in Dynamo's vLLM EPD flow that could silently produce incorrect generation parameters (#5833).

Performance & Hardware

  • SGLang Stream Output: Enforced stream_output=True in SGLang ServerArgs, switching from cumulative-to-delta token conversion to direct disjoint segment passthrough—reducing per-token processing overhead in streaming responses (#5510).
  • Multimodal Payload Optimization: Removed serialization/deserialization in gather_multi_model_data, significantly reducing latency for requests with large base64-encoded payloads (#5485).
  • Zero Copy TCP Decoder: Implemented zero copy decoder with bounded worker pool for TCP ingress, eliminating memory leaks under high concurrency and reducing per-message allocations (#5376).
  • MoE Data Parallel Tuning: Reduced VLLM_MOE_DP_CHUNK_SIZE to 384, lowering HBM footprint enough to enable inference on 16xH200 MoE configurations that previously hit OOM (#5307).
  • TRT-LLM GB200 Support: Resolved memory allocation failure on GB200 hardware (#5328) and updated the Wide-EP disaggregated GB200 recipe for compatibility with latest TRT-LLM version (#5383).

Router

  • Router Scheduling Intelligence: Added output block tracking with fractional decay for predictive load estimation (#5452), plumbed expected output tokens so the router can account for generation length when distributing requests (#5181), and added a flag to disable decode KV reuse assumption so the router computes actual block hashes for more accurate cache-hit predictions (#5350).
  • Routing Hints from Headers: Added support for reading routing hints from request headers, allowing external orchestrators (e.g., GAIE) to influence routing decisions without modifying the request body (#5502).
  • PrefillComplete Hook: Implemented PrefillComplete handling in Dyn...
Read more

Dynamo v0.8.1

23 Jan 07:37
5ea7ff0

Choose a tag to compare

Dynamo v0.8.1 Release Notes

Summary

Dynamo 0.8.1 is a patch release that adds profiler enhancements for Kubernetes deployments and addresses bug fixes for SGLang and worker identification. This release adds support for mounting model cache PVCs to profiler pods, fixes YAML configuration parsing for boolean flags in SGLang, resolves container build issues for CUDA 13 SGLang environments, and corrects a pod hash calculation issue that could affect worker identification in Kubernetes.

Base Branch: release/0.8.0

Major Features & Improvements

Kubernetes Deployment

  • Profiler Model Cache PVC Support: Added ability to mount model cache PVCs to profiler pods when specified in DynamoGraphDeploymentRequest, enabling profilers to access pre-downloaded model weights without re-downloading (#5212).

Bug Fixes

  • SGLang YAML Config Parsing: Fixed YAML config parsing for store_true arguments (e.g., trust-remote-code, enable-metrics) that were incorrectly converted to --flag true instead of just --flag, breaking boolean configuration options (#5513).
  • SGLang CUDA 13 Container Build: Fixed NVIDIA package installation in the SGLang CUDA 13 container to install CuDNN 9.16+ based on CUDA version, resolving PyTorch 2.9.1 compatibility issues with nn.Conv3d that caused performance degradation and excessive memory usage in multimodal workloads (#5461).
  • Worker ID Precision Loss: Fixed routing failures caused by f64 precision loss when worker/instance IDs exceeded 2^53, which caused approximately half of workers in large deployments to be unreachable for KV cache routing decisions (#5471).

Documentation

  • DGDR SLA Profiler Compatibility: Documented that DynamoGraphDeploymentRequest profiling configurations using camelCase field names and model cache PVC options require Dynamo 0.8.1 or later (#5492).

Known Issues

For known issues in this release, refer to the Known Issues section in the Dynamo v0.8.0 Release Notes.

Dynamo v0.8.0

15 Jan 22:24
115531a

Choose a tag to compare

Dynamo v0.8.0 Release Notes

Summary

Dynamo 0.8.0 continues the journey toward production-grade LLM serving with a Kubernetes-native architecture, expanded multimodal and agentic support, and enterprise-ready observability. This release reduces infrastructure complexity while providing a seamless experience regardless of which LLM framework you choose:

  • SGLang
  • TRT-LLM
  • vLLM

Kubernetes-Native Infrastructure

In order to address limitations on scaling from etcd and NATS, Dynamo 0.8.0 makes both optional for discovery and request planes: Kubernetes-native service discovery via EndpointSlices replaces etcd, and a transport-agnostic request plane with TCP as the default replaces NATS. Validation webhooks catch CRD errors at submission time, and the operator manages health checks and scaling directly. These changes leverage Kubernetes primitives rather than working around them.

Multimodal Support

Dynamo 0.8.0 expands multimodal support across all backends. Audio inference for vLLM enables models like Qwen2-Audio, and a new frontend video decoder handles video input with configurable frame sampling for SGLang, TRT-LLM, and vLLM. Llama4 multimodal now works in disaggregated prefill/decode mode, and KV-aware routing supports multimodal requests end-to-end for TensorRT-LLM. Security controls allow operators to restrict multimodal content sources in production environments.

Agentic Workflows

As AI applications evolve from single-turn inference toward autonomous agents that reason, plan, and take action, Dynamo is building the infrastructure to support these workflows. Tool calling is now available for DeepSeek V3/R1/V3.2, Qwen3 Coder, and Jamba model families—the models powering today's most capable agents. Named and Required tool choice modes give explicit control over tool selection, and schema-aware type conversion ensures parameter values match their declared types. The nvext extension field provides worker_id, TTFT, and per-request timing for debugging multi-step agent pipelines.

Disaggregated Serving Performance

Prefill/decode disaggregation at scale requires efficient coordination. Local KV indexers for SGLang and TRT-LLM reduce overhead with the central indexer, while dynamic rejection thresholds and early rejection protect decode workers from overload. Request cancellation now propagates cleanly during prefill-to-decode transitions, and frontend-based prefill routing for SGLang simplifies deployment topology. Non-blocking radix snapshots and async-first NIXL APIs improve transfer throughput across workers.

Production Observability and Resilience

Operating LLM infrastructure requires visibility and fault tolerance. Unified distributed tracing now propagates context from frontend through SGLang and vLLM backends, enabling end-to-end request debugging. A new Planner Grafana dashboard provides real-time SLA monitoring, and per-request metrics include prefill timing and KV cache hit rates. A complete CUDA fault injection framework enables GPU resilience testing in Kubernetes, allowing teams to validate recovery behavior before failures occur in production.

Multi-LoRA Serving

Dynamo 0.8.0 introduces comprehensive multi-LoRA serving for vLLM backends. Deterministic adapter ID generation enables consistent routing across replicas, while new management APIs and a local registry simplify adapter lifecycle. KV-aware routing now extends to LoRA requests, allowing the router to consider both prompt prefix and adapter state when selecting workers. Ready-to-use Kubernetes examples with MinIO sync demonstrate production-grade LoRA deployment patterns.


First-Time Contributors

  • @yuekaizhang contributed a PR that adds vLLM multimodal audio support for Qwen2-Audio models (#2760)!
  • @Dilu-Bilu contributed a PR that adds a guide for Speculative Decoding in vLLM using Eagle3 (#3895)!
  • @sozercan contributed a PR that updates the AKS deployment guide (#3651)
  • @nv-oviya contributed a PR that adds the CUDA fault injection library foundation (#4038)!
  • @flpanbin contributed a PR that adds dynamic default max_tokens support for vLLM backend (#4156)!
  • @AryanBagade contributed a PR that adds output token counter to frontend metrics (#4202)
  • @Spycsh contributed a PR that enables Intel Gaudi accelerators on Dynamo (#4209)!
  • @tangcy98 contributed a PR that adds tool call parser support for DeepSeek V3 and R1 (#4253)!
  • @chandlj contributed a PR that allows users to set --kv-transfer-config for vLLM (#4317)
  • @2ez4bz contributed a PR that enables autodeploy as a backend for TRT-LLM (#4347)!
  • @zhongxuanwang-nv contributed a PR that adds the nvext extension field to OpenAI APIs with worker_id reporting (#4372)!
  • @vladnosiv contributed a PR that fixes KV events config in aggregated router SGLang example (#4391)
  • @dmitrygx contributed a PR that fixes IPv6 support for SGLang ZMQ endpoint (#4403)
  • @nancya-nv contributed a PR that fixes model registration for SGLang multimodal workers (#4512)
  • @Monokaix contributed a PR that makes hostnames more descriptive and simplifies DNS check commands (#4551)
  • @c-fteixeira contributed a PR that disables etcd PodDisruptionBudget by default in Helm (#4602)
  • @hypdeb contributed a PR that fixes vLLM deprecation of disable_log_requests (#4659)
  • @esoba contributed a PR that adds logprobs support to TRT-LLM backend (#4759)!
  • @gtbai contributed a PR that installs Run:ai model streamer for vLLM (#4848)!
  • @MatejKosec contributed a PR that fixes vLLM multi-node support for TP and DP modes (#5006)!

Major Features & Improvements

Infrastructure Modernization

etcd Dependency Removal

Kubernetes-native service discovery replaces etcd dependency for simpler K8s deployments.

  • Kubernetes-Native Service Discovery: Introduced pluggable discovery system (#4070) with Kubernetes-native implementation via EndpointSlices (#4136), made K8s discovery the default (#5024), and added instance unregistration for clean scaling (#4459).
  • etcd-free Router and Operator: Updated Operator for etcd-less operation (#4214), made Router use the discovery pattern instead of etcd (#4244, #4597), removed etcd client dependency (#4489), and disabled etcd PodDisruptionBudget by default (#4602).
  • FileStore Auto-Expiring Leases: File-based key-value store entries now support automatic expiration with configurable TTL for local development without etcd. (#4301)
  • Remove Static Mode: Removed static endpoint functionality. Deployments using static endpoints must migrate to discovery-based endpoints. (#4235)
  • Namespace Computation: Normalized Dynamo namespace computation for consistent service discovery. (#5231)

NATS Dependency Removal

Dynamo 0.8.0 introduces a transport-agnostic request plane, enabling deployments without NATS for simpler infrastructure.

  • Transport-Agnostic Request Plane: Introduced transport-agnostic request plane (#4246), added --request-plane CLI flag for tcp, http, or nats selection (#4365), and made TCP the default transport (#4845).
  • NATS Infrastructure Cleanup: Made NATS metrics conditional on NATS usage (#4442), removed legacy stats handler (#4680), added Helm option to fully disable NATS deployment (#5035), and cleaned up internal NATS code (#4513, #4591).
  • Decentralized Router with NATS Core: Added support for NATS Core event routing mode as an alternative to JetStream. (#4921)

Multimodality Support

Expanded multimodal support across all backends with video, audio, and improved media handling.

  • vLLM Multimodal Audio: Added audio support for multimodal inference with Qwen2-Audio models (#2760), enabled efficient decoded media transfer via NIXL (#3988), and added security controls for multimodal requests (#4556).
  • Frontend Video Decoder: Added video decoder in the frontend preprocessor (#4719) with runtime-configurable settings for frame sampling and memory limits (#5011). Supports all multimodal backends (SGLang, TRT-LLM, vLLM).
  • Llama4 Multimodal Disaggregated Support: Migrated Llama4 multimodal support to disaggregated serving architecture. (#4213)
  • KV-Aware Routing Multimodal Support: Added multimodal support to KV-aware routing with standalone TRT-LLM example. (#4577)

OpenAI API

Enhanced OpenAI-compatible API with tool calling support for popular models.

  • DeepSeek Tool Calling: Added tool call parser for DeepSeek V3 and R1 (#4253), chat template support for V3.2 (#4797), and V3.2 tool calling support (#4822).
  • Qwen3 Coder Tool Parser: Added support for the Qwen3Coder tool-call format with detection and parsing of tool calls. (#4415)
  • Tool Choice Support: Added support for Named and Required tool choice modes, enabling explicit control over which tools the model uses. (#4722)
  • Jamba Tool Parsers: Added Jamba parser configuration for tool call parsing. (#4776)
  • Tool Definitions to Parsers: Tool definitions with parameter metadata can now be supplied to improve parsing accuracy. Parameter values are automatically converted to correct types based on schemas. (#4948)
  • prompt_tokens_details Support: Added prompt_tokens_details field in usage response for detailed token accounting. (#4239)
  • nvext Extension Field: Added nvext extension field to OpenAI APIs with worker_id reporting (#4372), and added TTFT and total request time (#4880).
  • include_stop_str_in_output Support: Added support for include_stop_str_in_output field in completions. (#4924)

Version Upgrades

  • **SGLan...
Read more

Dynamo v0.7.1

15 Dec 19:34
15f1a73

Choose a tag to compare

Dynamo v0.7.1 - Release Notes

Summary

Dynamo 0.7.1 is a patch release focusing on tool calling support, NIXL performance improvements, and preprocessing fixes. This release significantly expands function calling capabilities with new tool parsers for DeepSeek V3/R1 models and XML Coder format, improves NIXL concurrency and byte handling for better distributed inference performance, and fixes a critical preprocessor issue with stop token handling.

Base Branch: release/0.7.0.post1

Full Changelog

Performance and Framework Support

  • NIXL Byte Handling: Refactored how bytes are passed to NIXL in the nixl_connect module (#4860) to improve memory handling efficiency and compatibility with NIXL's native byte processing requirements for distributed KV cache transfers.
  • NIXL Concurrency Improvements: Enhanced concurrency support in the nixl_connect module (#4862) to enable better parallel processing of NIXL operations, improving throughput for disaggregated inference workloads with multiple concurrent requests.

Tool Calling Support

  • DeepSeek V3/R1 Tool Parser: Added toolcall parser support for DeepSeek V3 and DeepSeek R1 models (#4861) enabling function calling capabilities with these popular open-weight reasoning models for agentic workflows and structured output generation.
  • XML Coder Tool Parser: Implemented XML Coder tool parser format (#4859) providing an additional function calling format option for models that use XML-based tool definitions and responses.
  • Tool Call Configuration Types: Refactored tool call configuration with new config types (#4857) improving type safety, validation, and extensibility of tool calling configuration options across supported models and parsers.

Bug Fixes

  • Preprocessor Stop Field: Fixed preprocessor to properly populate the "stop" field in request handling (#4858) ensuring stop sequences are correctly propagated through the inference pipeline and models properly terminate generation at specified stop tokens.
  • min_tokens with ignore_eos: Fixed an issue where setting ignore_eos=true would automatically override min_tokens to equal max_tokens (#4908) ensuring users can continue generation past the EOS token without being forced to generate the maximum number of tokens.

Dynamo v0.7.0.post1

06 Dec 04:17
41a72ab

Choose a tag to compare

Dynamo v0.7.0.post1 - Release Notes

Summary

Dynamo 0.7.0.post1 is a minor release focusing on a TensorRT-LLM version upgrade, observability enhancements, and critical bug fixes. This release upgrades TensorRT-LLM to version 1.2.0rc3 with updated KV cache transfer defaults, adds comprehensive KServe health check endpoints for production monitoring, and resolves metrics visibility issues in Kubernetes deployments.

Base Branch: release/0.7.0

Full Changelog

Performance and Framework Support

  • TensorRT-LLM 1.2.0rc3: Upgraded TensorRT-LLM dependency to version 1.2.0rc3 (#4645) with updated KV cache transfer configuration changing the default from UCX-only to NIXL with UCX backend for improved memory transfer performance, making UCX KVCache opt-in rather than default.

Fault Tolerance & Observability

  • KServe gRPC Health Endpoints: Added gRPC health check endpoints for system monitoring including ServerLive, ServerReady, and ModelReady (#4708) to enable verification of server liveness, overall system readiness state, and per-model availability for improved Kubernetes integration with liveness and readiness probes.
  • KServe HTTP Metrics Endpoint: Added configurable HTTP metrics endpoint to KServe gRPC service (#4400) enabling concurrent execution of HTTP metrics and gRPC servers with custom host and port parameters for improved observability in production deployments.

Documentation

  • TensorRT-LLM Multimodal EPD: Updated TensorRT-LLM commit reference from v1.2.0rc2 to v1.2.0rc3 in multimodal EPD documentation (#4713) to ensure users build with the correct tested version for Encode-Prefill-Decode feature, aligning with the upgraded TensorRT-LLM dependency in this release.

Bug Fixes

  • LMCache Prometheus Metrics: Fixed LMCache metrics visibility when PROMETHEUS_MULTIPROC_DIR is explicitly set in Kubernetes deployments (#4654) by implementing dual-registry approach that resolves Prometheus registry conflicts, ensuring lmcache:* metrics are properly exposed in production environments.
  • KvEventPublisher Signature: Fixed KvEventPublisher method signature (#4754) to resolve compatibility issues with KV block manager event publishing system and prevent runtime errors in disaggregated deployments.

Known Issues

Helm Chart Image Tag Mismatch

There is a version format inconsistency between the Helm chart's appVersion and the Docker image tags in this release:

Resource Format Value
Git tag PEP 440 v0.7.0.post1
Helm Chart SemVer dynamo-platform-0.7.0-post1.tgz
Docker image PEP 440 kubernetes-operator:0.7.0.post1

Impact: Deploying the dynamo-platform Helm chart without overriding the image tag will fail with ImagePullBackOff. This occurs because the Helm chart's appVersion uses SemVer format (0.7.0-post1 with a hyphen), but the actual Docker images on nvcr.io use PEP 440 format (0.7.0.post1 with a dot).

Workaround: Explicitly override the image tag during Helm installation:

helm install dynamo-platform \
  --set "dynamo-operator.controllerManager.manager.image.tag=0.7.0.post1" \
  # ... other options

Or in a values file:

dynamo-operator:
  controllerManager:
    manager:
      image:
        tag: "0.7.0.post1"

Dynamo v0.7.0

26 Nov 19:54
f49d687

Choose a tag to compare

Dynamo v0.7.0 - Release Notes

Summary

Dynamo 0.7.0 focuses on production‑grade serving from configuration to deployment, infrastructure modernization, and expanded multimodal and performance capabilities. In this release, the path from AIConfigurator configs through planner layouts, Grove scheduling, and Kubernetes operators is hardened for real‑world operations, while the control plane is simplified with more cloud‑native building blocks. Dynamo seamlessly supports all major LLM frameworks:

  • SGlang
  • TensorRT-LLM
  • vLLM

Modular KV Block Manager

We are introducing KV Block Manager as a standalone pip-installable wheel, which decouples KV cache management from the serving stack, allowing more flexible integration options across inference engines and frameworks. KVBM currently supports TensorRT-LLM and vLLM with planned SGLang compatibility, and now, as a standalone implementation, can support inference frameworks like Triton. This modularity will enable operators to build the production topology that best fits their needs.

Production‑Grade Serving

Dynamo 0.7.0 strengthens the path from config to live traffic by connecting AIConfigurator deployment specs to planner layouts and Grove’s Kubernetes‑native scheduling. AIConfigurator lets you describe models and topologies once, and Grove turns those into correctly scheduled, autoscaled pods for multi‑node and disaggregated inference in your cluster, backed by finer‑grained fault tolerance, runtime health checks, and operator/CRD lifecycle automation (AIConfigurator, Grove).

Infrastructure Modernization

This release modernizes Dynamo’s infrastructure by beginning the removal of legacy NATS and etcd dependencies in favor of HTTP/TCP request‑plane transports and native Kubernetes service discovery. A filesystem‑backed key‑value store supports simple non‑distributed deployments, while refined system port configuration and a unified service discovery interface make the stack more cloud‑native, predictable to operate, and easier to test across environments.

Multimodal, Performance, and Framework Enhancements

Dynamo 0.7.0 sets the stage for multimodal support with base64 and HTTP image URL handling, unified media decoding and fetching paths, and refreshed vLLM multimodal examples for clearer setup. Performance and framework features advance with TensorRT‑LLM CUDA graphs and SGLang warmup optimizations.

First-Time Contributors

  • @YAMY1234 contributed a PR that adds automated prefill warmup for SGLang that reduces initial TTFT (#4058)!
  • @lynnmatrix contributed a PR that fixes a syntax error in dynamo.code-workspace (#4055)

Major Features & Improvements

Infrastructure Modernization

NATS Removal

  • Alternative Request Plane Transports: Implemented HTTP and TCP transport alternatives for the request plane via DYN_REQUEST_PLANE environment variable (#4307), enabling flexible deployment without NATS dependency and improving system simplicity.
  • Port Allocation Refactor: Removed port reservation system in favor of manual specification for more predictable deployments (#4142).
  • System Port Configuration: Deprecated DYN_SYSTEM_ENABLED in favor of DYN_SYSTEM_PORT for clearer configuration semantics (#4082).

ETCD Removal

  • Kubernetes-Based Service Discovery: Added native Kubernetes service discovery via EndpointSlices and metadata endpoint (#4136, #4150), enabling the Inference Gateway's Endpoint Picker to discover Dynamo workers without etcd for streamlined cloud-native deployments.
  • Operator Discovery Backend Configuration: Added --discovery-backend flag to the Kubernetes operator enabling selection between etcd (default) and Kubernetes native service discovery, supporting migration away from etcd dependency (#4268).
  • Filesystem KeyValueStore: Implemented filesystem-backed KeyValueStore for non-distributed deployments (#4138).

Multimodal

  • Multimodal Support: Added base64 and HTTP image URL support to vLLM workers for multimodal inference (#3967, #4114), image decoder in the frontend for preprocessing multimodal inputs (#3971), vLLM multimodal example for improved clarity and maintainability (#3634), media decoder and fetcher options in the MDC for flexible multimodal handling (#4094), media URL passthrough in OpenAI preprocessor for streamlined multimodal request handling (#3733), and multimodal example port allocation that match vLLM components for consistency (#4163).

OpenAI API

  • OpenAI API Compatibility: Added support for skip_special_tokens parameter in /v1/completions and /v1/chat/completions endpoints for flexible token handling (#4175) and rejection of unsupported parameters with 400 Bad Request response for /v1/chat/completions endpoint (#4021) and /v1/completions (#4140).
  • OpenAI API Batch Completions: Enhanced HTTP completion endpoint to accept arrays of prompts and generate multiple completions per prompt for improved throughput (#3953).

Version Upgrades

  • CUDA 13 Support: Upgraded TensorRT-LLM containers to CUDA 13.0-based images, enabling compatibility with next-generation GPU architectures and the latest CUDA toolkit features (#4405).
  • TensorRT-LLM 1.2.0rc2: Updated TensorRT-LLM to version 1.2.0rc2 with enhanced engine initialization, improved sampling parameter handling, and better ARM64 support (#4405).
  • SGLang 0.5.3.post4: Updated SGLang to version 0.5.3.post4 with performance improvements, bug fixes, and enhanced compatibility for distributed serving (#4227).

Performance and Framework Support

  • TensorRT-LLM CUDA Graphs: Implemented ForwardPassCallback API from TensorRT-LLM to register end of forward pass callbacks, enabling CUDA graphs for improved inference performance (#3297).
  • SGLang Warmup Optimization: Added dummy warmup request for SGLang prefill workers to reduce cold start latency (#4058).

Fault Tolerance & Observability

  • ETCD Resilience: Implemented ETCD high availability client failover with lease keep-alive resilience for improved distributed coordination reliability (#3868) and lease watch resilience to handle ETCD server failures gracefully (#3950).
  • Runtime Health Checks: Added --runtime-check flag to sanity_check.py for validating runtime environment configuration (#4102).
  • CPU Metrics Dashboard: Added CPU metrics to Grafana Dynamo Dashboard for comprehensive system monitoring (#3908).
  • Frontend Metrics: Added an output token counter to frontend metrics to improve usage tracking and observability (#4231).
  • vLLM Prefill Metrics: Added prefill worker metrics support for vLLM to improve observability of prefill operations (#3949).
  • Graceful Import Error Handling: Added try-except blocks with AttributeError handling for optional dependency imports to prevent crashes when modules are installed but missing expected attributes (#4392).

Kubernetes Deployment

  • Operator Deployment and Isolation: Enabled cluster-wide and namespace-restricted operators to coexist in the same cluster, updated dynamoNamespace to use Kubernetes namespace plus DGD name, and removed cluster-wide logic from namespace-restricted operators to improve security and multi-tenancy (#3966, #4126, #3934).
  • Operator and CRD Lifecycle Automation: Introduced the dynamoModel Custom Resource Definition for model lifecycle management and added CI checks to ensure operator code generation keeps CRDs and generated code in sync (#4166, #4139).
  • Profiler Storage Migration to ConfigMaps: Migrated profiler result storage from PVCs to ConfigMaps, removed deprecated PVC manipulation scripts, and eliminated PVC logic from the profiler planner to simplify deployment and reduce storage overhead ([#3981](https://github...
Read more

Dynamo v0.6.1

06 Nov 21:49
v0.6.1
4ad03ae

Choose a tag to compare

Dynamo v0.6.1 Release Notes

Summary

Dynamo 0.6.1 focuses on improving production readiness, our disaggregated inference architecture, and as always performance optimization. In addition, Dynamo 0.6.1 contains the second tranche of UX improvements and upgrades to provide a world-class dev experience. Dynamo seamlessly supports all major LLM frameworks:

  • TensorRT-LLM
  • vLLM
  • SGlang

Production Readiness: Kubernetes deployment capabilities matured with comprehensive operator improvements including vLLM DP across multiple nodes, automated DGDR profiling as a Kubernetes CR, and intelligent Grove resource allocation. Pre-deployment validation prevents configuration errors before cluster deployment. The build system streamlined with Docker refactoring and devcontainer standardization for pytest compatibility.

KV Router: KV Router architecture evolved to support disaggregated prefill/decode serving with the prefill router now integrated directly into the frontend. Radix tree operations gained non-blocking locks for better concurrency, while Python bindings released the GIL during operations. Metrics collection expanded with TensorRT-LLM Prometheus support and a redesigned composition-based API.

Developer Experience: Documentation underwent a major reorganization to improve clarity and navigation, with content restructured into logical categories and broken links fixed. New guides cover KVBM connector APIs, KV Smart Router benchmarking, and request cancellation for all backends. Model recipes expanded with clearly stated GPU requirements as well as the first Qwen3-32B-FP8 recipe. Lastly, deployment guides added AIConfigurator examples for disaggregated inference.

Major Features & Improvements

Performance and Framework Support

  • AIPerf Benchmarking: Updated benchmarking infrastructure by replacing genai-perf with aiperf across components/backends and benchmarking scripts (#3528, #3533, #3306) to standardize performance testing.
  • Profiling Automation: Added support for YAML config input for pre-deployment sweep script (#3622) and automatic profiling config generation (#3787) to streamline performance optimization workflows.
  • GKE Examples: Published GKE deployment examples (#2721) showcasing cloud platform compatibility.
  • GB200 Support: Enhanced SGLang with experimental GB200 FP4 support and updated GB200 FP8 commands (#3745) for latest hardware optimizations.
  • API Enhancements: Extended TensorRequest and TensorResponse to contain extra parameters (#3761) and added echo parameter validation for /v1/completions (#3813) for enhanced API capabilities.
  • Python Performance: Optimized Python bindings with GIL release for radix tree operations and added dump_tree_as_events functionality (#3748) to improve concurrency.
  • Model Management: Improved frontend with model config files (tokenizer.json et al.) retrieved from MX (#3659) and added Python binding for model download (#3593) to simplify model management.
  • Metrics Optimization: Cached compiled regex patterns in Prometheus metrics filtering (#3825) for performance optimization.

Fault Tolerance & Observability

  • Exception Handling: Implemented TensorRT-LLM exception catching (#3544) for improved error handling.
  • Request Cancellation: Enabled request cancellation during or before stream establishment (#3635) to prevent resource leaks.
  • Metrics Infrastructure: Added TensorRT-LLM Prometheus metrics support with prefixing and filtering (#3676), completed TensorRT-LLM and SGLang metrics validation (#3842), and redesigned metrics API from Trait to composition (#3687) for cleaner observability architecture.
  • Audit Logging: Implemented NATS sink for audit logging (#3732) for comprehensive system tracking.
  • Test Monitoring: Added test metrics upload (#3648) for continuous quality monitoring.
  • Deployment Validation: Added multiple _core*.so detection in sanity_check.py (#3803) to prevent deployment issues.

Kubernetes Deployment

  • Pre-Deployment Validation: Added pre-deployment checks (#3573) to validate cluster readiness.
  • Multi-Node vLLM: Enabled vLLM data parallelism multi-node support in operator (#3595) for distributed deployments.
  • Deployment Simplification: Streamlined GAIE deployment with blackbox available via simple flag (#3591) and enabled routers sync in EPP (#3657) for simplified deployment workflows.
  • E2E Testing: Added e2e Dynamo deploy tests (#3243) for comprehensive validation.
  • DGDR (Dynamo Graph Deployment Improvements: Added DGDR custom resource (#3489), refactored DGDR to use profiler's native configuration format (#3758), turned profiling k8s jobs into sample DGDR requests (#3864), and removed deploy/utils RBAC (#3771) to improve operator functionality.
  • Grove Integration: Implemented Grove detection with automatic usage when available (#3789) for intelligent resource allocation.
  • MoE Testing: Added vLLM MoE Kubernetes functional tests (#3672) for backend validation.
  • Docker Refactoring: Refactored Docker builds by moving EPP build dockerfile (#3555), removing redundant COPY in dev stage of framework Dockerfiles (#3690), and removing unused build args with updated comments (#3688).
  • Dev Environment: Standardized development environment with devcontainer configuration using /workspace paths (#3870) and removed hardcoded /workspace paths across tests (#3888) for pytest compatibility.
  • Config Organization: Moved engine configs out of components directory (#3772) for better organization.
  • EPP Simplification: Removed component parameter from EPP (#3831) to simplify configuration.

KV Block Manager

  • GPU-to-Disk Offload: Enabled KVBM GPU offload to Disk bypassing CPU (#3510) to support performance benchmarking efforts.
  • Architecture Simplification: Eliminated ETCD from leader-worker initialization (#3202) to simplify KVBM architecture and reduce dependencies.

Scheduling

Planner

  • Prefill Discovery: Added prefill workers to discovery (#3709) for disaggregated serving support.
  • Profiling Jobs: Planner's pre-deployment profiling job is now implemented as a DGDR custom resource for improved operator integration.

Router

  • Request Cleanup: Implemented router frees request from slot manager on stopped requests (#3623) to prevent memory leaks.
  • Data Parallelism Routing: Added DP rank routing (#3597) for data parallelism support.
  • Radix Tree Concurrency: Implemented non-blocking lock for radix uploading and read lock for radix downloading (#3655) to improve concurrency.
  • Prefill/Decode Disaggregation: Baked prefill router into frontend, supporting vLLM initially (#3762) as a major architectural enhancement for disaggregated inference.

Other

Python Bindings & API

  • ABI Compatibility: Built Python package with ABI compatibility (cross py3.10+) (#3571) for broader Python version support.
  • KServe Support: Added Python binding for KServe gRPC frontend (#3739) to support standard inference protocols.

Runtime Improvements

  • Mutex Optimization: Replaced std::sync::Mutex with parking_lot::Mutex in runtime (#3740) for performance optimization.
  • Optional Dependencies: Made nats_client optional internally...
Read more

v0.6.0

28 Oct 12:57
e02605b

Choose a tag to compare

Dynamo 0.6.0 Release Notes

Dynamo v0.6.0 strengthens Dynamo's production readiness with comprehensive fault tolerance and observability capabilities, advanced Kubernetes deployment infrastructure, and a vastly improved developer experience with better documentation and more unified experience across the LLM inferences engines ((see Support Matrix for details):

  • NVIDIA TRT-LLM
  • vLLM
  • SGLang

Fault Tolerance & Observability: Request cancellation across all backends ensures clean resource cleanup and prevents resource leaks. Coordinated shutdown processes eliminate VRAM leaks, while automatic worker inhibition prevents cascading failures. Unified metrics collection, OTEL/Tempo distributed tracing, and audit logging provide complete request visibility. Troubleshoot issues faster with end-to-end tracking across all processes and real-time system monitoring.

Developer Experience & Deployment Infrastructure: Dev containers for all frameworks (vLLM, SGLang, TensorRT-LLM) streamline local development. An overhaul of our documentation provides more consistency in the user path for each of the different frameworks. Custom chat templates and comprehensive documentation guides accelerate time to production. Multi-node Kubernetes examples demonstrate proper startup ordering and ARM64 support enables Dynamo deployment across an even larger set of hardware configurations. Automated planner integration and cluster-wide operator installation simplify deployments at scale. The SLA-aware planner with automated profiling optimizes resource allocation, while prefill-aware routing across all backends improves efficiency. Enhanced KV Block Manager adds prefill/decode disaggregation, disk offloading with access pattern filtering, and comprehensive metrics for fine-grained control.


Major Features and Improvements

1. Performance and Framework Support

  • Published recipes, performance sweeps, and benchmarks on InferenceMax, showcasing performance gains and TCO benefits from using Dynamo
  • Added Rayon compute pool for CPU-intensive operations (#2969), improved snapshot performance with reverse lookup (#3370), and optimized request processing with event-driven metrics updates (#3207) to optimize performance.
  • Added ability to run without etcd (#2281) for simplified deployments in controlled environments.
  • Added custom chat templates (#3165, #3332, #3362).
  • Parsers library with JSON-based parsers, parallel tool calling (#3188), reasoning transformation (#3295), and GPT-OSS reasoning integration (#3321)

2. Fault Tolerance & Observability

  • Implemented request cancellation across all backends (vLLM #3465, TensorRT-LLM #3193, SGLang #3465) enabling clean resource cleanup and preventing resource leaks
  • Implemented coordinated shutdown processes (#3481, #3513) with SIGINT/SIGTERM handling and vLLM engine cleanup (#2898) to prevent VRAM leaks and ensure clean service restarts
  • Unified metrics collection with cross-process instrumentation (#2243), Python Metrics Registry (#3341), and OTEL/Tempo visualization (#3307, #3160)
  • Added distributed tracing context support to Python bindings (#3160) and OTEL exporter with Tempo visualization (#3307) for end-to-end request tracking
  • Improved error messaging (#3587, #3210, #3549) to quickly identify and resolve deployment issues.
  • Standardized Prometheus naming conventions (#3035), added SGLang/vLLM passthrough metrics (#3539), custom NIM backend metrics (#3266), and configurable histogram buckets (#3562).
  • Implemented audit logging for chat completions (#3062) and comprehensive system status tracking with uptime monitoring (#2354, #3411)

3. Kubernetes Deployment

  • Deployed multi-node examples for TensorRT-LLM, vLLM, and SGLang (#3100, #3462) with startup ordering, resource coordination, and multinode operator behavior documentation (#3506).
  • Enabled ARM64 build support (#3146) for broader deployment compatibility
  • Added tolerations support (#2445), custom annotations, vLLM compilation cache (#3257), and improved Grove integration.
  • Enhanced conditional backend workflows (#3141), operator build per-commit (#3712), and improved container build metrics (#3461).
  • Implemented cluster-wide operator installation (#3199), and improved CRD documentation (#3504) for improved security and management
  • Automated planner deployment in Kubernetes operator with cluster-wide service account setup (#3520).
  • Using AIPerf for K8s FT tests to replace genai-perf (#3289); Added legacy client with AIPerf for FT tests to support both testing modes (#3415).

4. KV Block Manager

  • Enable PD disaggregation in vLLM to support prefill/decode separation (#3352).
  • Disk offloading filtering to selectively offload based on access patterns [#3532] to extend SSD lifespan.
  • Added KVBM metrics (#3561) for and cache hit analysis to optimize memory utilization; Added FullyContiguous layouts (#3090) for optimal data transfer efficiency.

5. Scheduling

Planner

  • Deployed SLA planner with automated parallelization mapping, pre-deployment profiling (#3441), and performance prediction to optimize resource allocation.

Router

  • Implemented approximate KV routing and prefill-aware routing; Enabled prefill router support across all backends(#3401, #3329, #3471, #3498).

6. User Experience and Documentation

  • Added comprehensive guides and documentation: tool calling (#2866), KVBM (#3578, #3759), vLLM KVBM 2P2D example [#3526], KV Router A/B testing guide [#3742], standalone routing python bindings[#3308], multinode operator behavior [#3506], planner quickstart guide[#3358], deployment examples, and API references with OpenAPI routes (#3480); Added gpu details for model recipes [#3707].
  • Major documentation reorganization (#3756, #3440), added version switcher (#3711); Restructured source code for better packaging (#3201), added component auto-discovery (#3348), fixed non-editable installs (#3478), and improved cross-platform compatibility (#3044).
  • Dev container for vLLM, SGLang, and TensorRT-LLM to streamline local development [#3228, #3576]
  • Added framework-specific test markers (#2561), and unit tests for custom jinja templates for vLLM, TRT-LLM and SGLang [#3165, #3332, #3362, #3472]

7. Others:

  • Tool Calling & Reasoning: Reasoning parser transformation to extract reasoning tokens #3295, E2e test for reasoning_effort on gpt-oss to validate reasoning modes [#3421]
  • Multimodal: Multimodal EPD for SGLang to support image inputs #3230

8. Bug Fixes

  • Fix TensorRT-LLM multinode command #3311, 3373
  • Remove VSWA user prompts #3404
  • Fixed circular rust dependencies (#3609), corrected commit info copying (#3670), resolved CUDA lock issues (#3704), and improved container security (#3367).
  • Update distroless go container for openssl #3486
  • Attach dynamo namespace label to Grove PodCliques #3359
  • Router registration to etcd #3302

Framework Updates

  • Upgrade NIXL to 0.6.0 #3550
  • Upgrade vLLM to 0.11.0 #3422
  • Upgrade SGLang container and version #3647

--

Known Issues

  • Building the Dynamo custom end-point picker (EPP) fails due to the incorrect filepath for patch
  • DS-R1 recipe fails due to the SLA profiling process crashing
  • GPT-OSS recipe does not work out-of-the-box

What's Next

Following this 0.6.0 release and a minor 0.6.1 release, our next major release (0.7.0) we will continue to prioritize improving performance across top models, ensuring robust fault-tolerance & observability in production scenarios, more backend integrations, production-grade cache management support, smarter routing/scheduling strategies, and developer-friendly UX.

We'd love to hear your feedback and comments - please open up an issue for any feature requests and chat with us on Discord!


Release Assets

Python Wheels:

Rust Crates:

Containers:

  • TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.6.0 NGC
  • vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.6.0 NGC
  • SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.6.0 NGC
  • Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.6.0 NGC

Helm Charts:

Read more

Dynamo v0.5.1

14 Oct 02:02
3ecc1fb

Choose a tag to compare

Dynamo Release v0.5.1

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models at data-center scale. We're an open-source first project under the Apache 2.0 license; built in Rust for performance and Python for extensibility. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details):

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release delivers major advances in KV routing capabilities with the new vLLM prefill router and commit router, comprehensive canary health checks across all backends, and significant tool calling enhancements. We strengthened production reliability with request cancellation support, improved Kubernetes deployment workflows, and expanded multinode capabilities. Lastly, we enhanced KVBM performance with vectorized memory transfers and tighter integration with TensorRT-LLM v1.1.0rc5.

Major Features and Improvements

1. Advanced KV Routing & Cache Management

KV Router

  • Introduced vLLM prefill router for optimized prefill phase handling (#3155)
  • Implemented KV commit router for improved cache consistency (#3024)
  • Added router benchmarking capabilities with mooncake-style testing (#3068, #2828)
  • Enabled router to optionally skip tracking active blocks during prefill and cached blocks during decode (#3135)
  • Router replicas with state-sharing for improved scalability (continued from v0.4.1)

KVBM (KV Block Manager)

  • Implemented vectorized copy between pinned memory and device memory for improved transfer performance (#2989)
  • Enhanced KVBM transfer context v2 (#2873)
  • Added KV indexer metrics for better observability (#2905)
  • Updated integration with TensorRT-LLM v1.1.0rc5 connector API (#2979, #3119)
  • Improved error handling with early stop for missing CPU/disk configuration (#2997)

2. Enhanced Health Checks & Reliability

Canary Health Checks

  • Implemented canary health check framework (#2903)
  • Added TensorRT-LLM canary health check with BOS token support (#3082, #3145)
  • Deployed SGLang canary health check (#3103, #3123)
  • Enabled vLLM prefill-specific health check payload (#3126)

Request Management

  • Added request cancellation support for unary requests (#3004)
  • Enabled vLLM abort while engine generates next token (#3102)
  • Implemented router-level request rejection for better resource management

3. Tool Calling & Reasoning Enhancements

  • Enabled tool calling with stream=True support (#2932)
  • Added Deepseek V3.1 tool parser with library refactoring (#2832)
  • Implemented Granite class reasoning parser (#2936)
  • Enhanced GPT-OSS frontend with Harmony tool calling and reasoning parsers (#2999)
  • Added finish reason tool_calls for non-streaming responses (#3087)
  • Fixed null tools processing via minijinja (#3340)

4. Kubernetes & Deployment Improvements

Grove Integration

  • Updated to official Grove 0.1.0-alpha release (#3030)
  • Added planner manifest support for Grove (#3203)

Deployment Enhancements

  • Installed Dynamo operator cluster-wide by default (#3199)
  • Added multinode K8s examples for TensorRT-LLM and vLLM (#3100)
  • Enabled in-cluster performance benchmarks with kubectl one-liner (#3144)
  • Implemented namespace isolation for improved multi-tenancy (#2394, #2970)
  • Added virtual connector for 3rd party deployments (#2913)
  • Improved SGLang multinode handling in operator (#3151)

5. Observability & Metrics

  • Added HTTP queue metrics for NIM frontend request tracking (#2914)
  • Implemented NIM FE runtime config metrics with periodic polling (#3107)
  • Added metrics labels for multimodal workloads (#2835)
  • Implemented frontend disconnect metrics (#2953)
  • Unified component metric names to prevent Kubernetes label collisions (continued from v0.4.1)

6. Frontend & Model Support

  • Added support for serving multiple models from single endpoint (continued from v0.4.1)
  • Implemented --custom-jinja-template argument for custom chat templates (#2829)
  • Added chat_template_kwargs parameter to v1/chat/completion (#3016)
  • Enabled framework tokenization/detokenization (#3134)
  • Implemented ModelExpress Dynamo integration (#3191)
  • Added SLA Planner support for TensorRT-LLM (#2980) and SGLang MoE models (#3185)

7. Performance & Optimization

  • Refactored discovery ModelManager to use parking_lot::RwLock (#2902)
  • Ported vLLM port allocator to Rust bindings for improved performance (#3125)
  • Implemented JailedStream for better resource management (#3034)
  • Added generic tensor type for inference (#2746)
  • Updated benchmarking and deployment utilities (#2933, #2973, #3098)

8. Bug Fixes

  • Fixed OpenAI-compliant usage stats for streaming responses (#3022)
  • Resolved token loss bug in final packet (#2985)
  • Fixed aggregate logprobs calculation (#2928)
  • Corrected Harmony parser streaming behavior (#3074)
  • Fixed router slot manager force expire requests (#2840)
  • Resolved metrics collection and namespace sanitization issues (#2868)
  • Fixed polling from exhausted stream in preprocessor (#3349)
  • Addressed KVBM fully contiguous memory region size bug (#3175)

Documentation

  • Revamped Kubernetes documentation (#3173)
  • Created deployment and benchmarking recipes for Llama3-70B and GPT-OSS-120B (#2792)
  • Added AWS ECS deployment example for Dynamo vLLM (#2415, #3381)
  • Published Python runtime request cancellation examples (#2893)
  • Added health check and structured logs documentation (#2805)
  • Created mermaid diagrams showcasing KV router features (#3184)
  • Updated consistent hashing documentation for KV events (#2981)
  • Published profiling-related documentation updates (#2816)
  • Fixed broken links and Sphinx structural errors (#3186, #3342)

Build, CI, and Test

  • Restructured TensorRT-LLM and SGLang to follow container strategy structure (#3009, #2803)
  • Moved to ARC runners for CI (#2904)
  • Added SGLang functional tests (#2943)
  • Implemented fault injection tests for Kubernetes (#3194)
  • Added concurrency checks to auto-cancel running actions (#2438)
  • Created broken links checker (#2927)
  • Converted vLLM multimodal examples to pytest framework (continued from v0.4.1)
  • Updated TensorRT-LLM to v1.1.0rc5 (#3119)

Migration Notes

  • Component metric names continue to use the dynamo_component_* pattern. Ensure dashboards and alerting rules are updated accordingly.
  • The Dynamo operator now installs cluster-wide by default. If namespace-scoped installation is required, use the appropriate Helm values.
  • TensorRT-LLM has been updated to v1.1.0rc5, which includes KVBM integration changes. Review the updated connector API if using custom integrations.
  • The Multinode Multimodal Guide works only with release v0.5.0. Users requiring multinode multimodal functionality should continue using v0.5.0 until support is restored in a future release.

Looking Forward
This release strengthens Dynamo's production readiness with advanced KV routing, comprehensive health monitoring, and robust request management. The enhanced Kubernetes integration and multinode support enable seamless scaling for enterprise deployments. With improved observability and the new prefill router, teams can now optimize both throughput and latency for diverse workload patterns. These capabilities set the stage for even more sophisticated routing strategies and performance optimizations in future releases!

Release Assets
Python Wheels:

Containers:

  • TensorRT-LLM Runtime: nvcr.io/nvidia/ai-dynamo/tensorrtllm-runtime:0.5.1 NGC
  • vLLM Runtime: nvcr.io/nvidia/ai-dynamo/vllm-runtime:0.5.1 NGC
  • SGLang Runtime: nvcr.io/nvidia/ai-dynamo/sglang-runtime:0.5.1 NGC
  • Dynamo Kubernetes Operator: nvcr.io/nvidia/ai-dynamo/kubernetes-operator:0.5.1 NGC

Helm Charts:

Contributors
We welcome new contributors in this release: @blarson-b10, @lixuwei2333, @GavinZhu-GMI, @nv-hwoo
Full Changelog: v0.5.0...v0.5.1

Dynamo Release v0.5.0

18 Sep 22:47
65f12d7

Choose a tag to compare

Dynamo 0.5.0 Release Notes

Dynamo is a high-performance, low-latency inference framework designed to serve generative AI models—across any framework, architecture, or deployment scale. It's an open-source project under the Apache 2.0 license. Dynamo is available for installation via pip wheels and containers from NVIDIA NGC.

Dynamo supports multiple large language model (LLM) inference engines (see Support Matrix for details).

  • NVIDIA TensorRT-LLM
  • vLLM
  • SGLang

Release Highlights

This release introduces TRT-LLM integration for KV cache management, supports gRPC support and tool calling capabilities. We also delivered major improvements to system reliability, with request cancellation features and improved observability.


Major Features and Improvements

1. Fault Tolerance & Observability

  • Implemented End to End request cancellation (#2158, #2500) with Python context propagation
  • Implemented DRT shutdown on vLLM engine failures (#2698)
  • Added fast-fail validation for NATS JetStream requirements to prevent silent failures (#2590)
  • Unified metrics across all components with model labels for vLLM (#2474), TensorRT-LLM (#2666), and SGLang (#2679)
  • Standardized Prometheus metrics naming and sanitization with KvStats integration (#2733, #2704)
  • Added automatic uptime tracking and auto-start of metrics collection upon NATS service creation (#2587, #2664), improving observability readiness

2. Kubernetes Deployments

  • Integrated Grove and KAI scheduler into Dynamo Cloud Helm chart for multi-node deployments (#2755)
  • Implemented auto-injection of kai-scheduler annotations and labels with parent DGD Kubernetes name support (#2748, #2774)
  • Deployed Dynamo EPP-aware gateway with prevention of double-tokenization for optimized routing (#2633, #2559)
  • Integrated Model Express client for optimized model downloads with URL injection support (#2574, #2769)

3. KV Cache Management & Transfer

  • Integrated Dynamo KVBM connector API with TensorRT-LLM for G2-G3 offloading and onboarding (#2544)
  • Added support for user selection among multiple KV transfer connectors (nixl, kvbm, lmcache) (#2517)
  • Added detailed KV Block Manager metrics for match, offload, and onboard operations (#2626, #2673)

4. Planning & Routing

Router

  • Separated frontend and Router, through Python bindings for KvPushRouter, so the Router and frontend can be scaled independently (#2658, #2548)
  • Implemented warm restarts via durable KV event consumers and radix snapshotting for router persistence (#2756, #2740, #2800)

Planner

  • Added comprehensive tests for replica calculation and planner scaling with automated Kubernetes deployment validation (#2525)
  • Added SLA planner dry-run mode with a CLI to simulate workloads, generate plots, and expose optional Prometheus metrics (#2557)

5. Others

Tool Calling

  • Introduced parsers library (#2542) supporting multiple reasoning and tool-calling formats.
  • Implemented multiple tool-calling parsers, including Pythonic (#2788), Harmony (#2796), and JSON-based parsers with normal text parsing alongside tool calls (#2709)
  • Added support for separating reasoning from visible text (#2555) along with GPT-OSS reasoning parser integration (#2656)
  • Added support for custom logits processors in the TensorRT-LLM backend, enabling in-place logits modification during generation (#2613, #2702)

Multimodal Support Expansion

  • Added complete multimodal deployment examples for Llava and Qwen, with video support using vLLM v1 (#2628, #2694, #2738)
  • Added Encode Worker and NIXL support for TensorRT-LLM multimodal disaggregated flows (#2452)

Infrastructure & Performance

  • Added comprehensive KServe gRPC support for industry-standard model inference protocol (#2638)
  • Enhanced Hugging Face integration with HF_HOME and HF_ENDPOINT environment variable support (#2642, #2637)

Developer Experience

  • Added Devcontainer improvements with enhanced documentation and SGLang-specific setup (#2255, #2578, #2741)
  • Added logging setup for Kubernetes with Loki integration and Grafana dashboards (#2699)
  • Added benchmarking guide with GenAI-Perf integration and automated performance comparison (#2620)
  • Updated TensorRT-LLM to 1.0.0rc6 and simplified Eagle model configuration (#2606, #2661)

Bug Fixes

  • Improved Hugging Face download speeds with better API client configuration (#2566)
  • Added missing Prometheus to runtime images for SGLang and general runtime (#2565, #2689)
  • Fixed kv-event-config command line respect and environment variable overrides (#2627, #2640)
  • Enhanced pytest robustness and parsing errors with proper timeout handling (#2676, #2572)
  • Resolved metrics registration timing issues and prevented early returns from affecting measurements (#2664, #2576)

Documentation

  • Created SNS aggregated Kubernetes example and simplified sphinx build process (#2773, #2519)
  • Streamlined cloud installation documentation and deployment guides (#2818)
  • Updated benchmarking framework documentation with comprehensive deployment guides (#2620)
  • Updated supported models documentation for multimodal and SGLang container build instructions (#2651, #2707)

Build, CI, and Test

  • Added replica calculation and planner scaling tests with automated Kubernetes deployment validation (#2525)
  • Added vLLM sanity testing support on GitHub Actions with build optimizations (#2526)
  • Optimized CI job execution for docs-only changes and Rust-specific changes (#2775)
  • Enabled KVBM in vLLM container with improved virtual environment handling (#2763)
  • Enhanced test reliability with proper KVBM test exclusions and determinism testing (#2611)
  • Fixed concurrency settings to prevent main branch run cancellations (#2780)
  • Improved container build process with default dev builds for vLLM (#2837)

Migration Notes

  • Parser Integration: New parsing capabilities require updated CLI flags for reasoning and tool calling features
  • Container Updates: Runtime images now include Prometheus by default - review monitoring configurations

Looking Forward

This release sets the stage for more features in our H2 roadmap, including benchmarking KVBM performance, E2E performance, and improved fault tolerance and request rejection at every level. We will focus on significantly updating documentation and examples for a better experience and include Kubernetes benchmark scripts for most popular models.

Release Assets

Python Wheels:

Rust Crates:

Containers:

Helm Charts:


Contributors

We welcome new contributors in this release:
@jasonqinzhou, @michaelfeil, @ahinsutime, @bhuvan002, @WaelBKZ, @hhk7734, @Michaelgathara, @KavinKrishnan, @michaelshin

Full Changelog: v0.4.1...v0.5.0