[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler by peaceh-nv · Pull Request #11307 · NVIDIA/TensorRT-LLM

peaceh-nv · 2026-02-05T07:56:53Z

Implement python based max utilization scheduler based on kv cache v2

Move KVCacheV2DummyScheduler and KVCacheV2MaxUtilizationScheduler to SimpleUnifiedScheduler
KV cache manager V2 based Max util scheduler need to do prepare_resources inside schedule_request and no need to do prepare_resources afterwards.

…on kv cache v2 Signed-off-by: Peace He <peaceh@nvl72162-T06.cm.cluster>

coderabbitai · 2026-02-05T08:10:34Z

📝 Walkthrough

Walkthrough

The pull request introduces KVCacheV2-based scheduling enhancements to the PyTorch executor system. Changes include: simplifying scheduler selection logic in the utility module, adding scheduler-aware resource preparation tracking to cache managers, introducing KVCacheV2MaxUtilizationScheduler with capacity management logic, and adding corresponding policy wrappers for integration. A new test validates scheduling consistency across policies.

Changes

Cohort / File(s)	Summary
Scheduler Selection and Simplification `tensorrt_llm/_torch/pyexecutor/_util.py`	Removed KVCacheV2DummyScheduler handling; simplified scheduler selection to favor SimpleUnifiedScheduler when KVCacheV2 is present or Python scheduler is enabled, with straightforward BindCapacityScheduler fallback.
Resource Preparation and Scheduling Coordination `tensorrt_llm/_torch/pyexecutor/resource_manager.py`	Added `_scheduler_prepared_resources` internal flag to KVCacheManager and KVCacheManagerV2 to track per-round scheduler resource allocation. Introduced `_prepare_resources_guaranteed_no_evict()` helper methods to decouple scheduler-driven vs. manager-driven resource preparation.
KVCacheV2 Scheduler Classes and Policy Integration `tensorrt_llm/_torch/pyexecutor/scheduler.py`	Added KVCacheV2MaxUtilizationScheduler for capacity-aware scheduling with eviction logic and block allocation. Introduced KVCacheV2DummyPolicy and KVCacheV2MaxUtilizationPolicy wrappers for policy-based scheduler wiring. Extended PyCapacityScheduler._create_policy() to instantiate KVCacheV2 policies when using KVCacheManagerV2. Added prepare_resources delegation in SimpleUnifiedScheduler and related classes.
KVCacheV2 Scheduling Test `tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`	New test module validating output consistency between GUARANTEED_NO_EVICT and MAX_UTILIZATION scheduling policies using KvCacheConfigV2 with deterministic sampling across multiple prompts.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Scheduler as PyCapacityScheduler
    participant Policy as KVCacheV2Policy
    participant KVMgr as KVCacheManagerV2
    participant ResMgr as ResourceManager

    Client->>Scheduler: schedule_request(active_requests)
    Scheduler->>Scheduler: _create_policy() for KVCacheV2
    Scheduler->>Policy: schedule(scheduler, active_requests)
    Policy->>Scheduler: delegate scheduling logic
    Scheduler->>Scheduler: partition into scheduled/paused/generation
    Scheduler->>ResMgr: prepare_resources(context, generation)
    ResMgr->>ResMgr: check _scheduler_prepared_resources flag
    alt Scheduler Already Prepared
        ResMgr->>ResMgr: reset flag, skip allocation
    else Scheduler Did Not Prepare
        ResMgr->>KVMgr: _prepare_resources_guaranteed_no_evict()
        KVMgr->>KVMgr: allocate/adjust KV cache blocks
        KVMgr-->>ResMgr: updated request lists
    end
    ResMgr-->>Scheduler: prepared context/generation requests
    Policy->>Policy: apply prepare_resources if available
    Scheduler-->>Client: scheduled/paused/generation requests

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 51.85% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	⚠️ Warning	PR description lacks required detail: missing explanation of the issue/motivation, test coverage details, and incomplete checklist items.	Expand the Description section with clear motivation and technical explanation. Add specific test cases under Test Coverage. Complete the PR Checklist by reviewing and checking all applicable items.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main change: implementing a Python-based KV cache V2 scheduler, which is reflected across multiple files (_util.py, resource_manager.py, scheduler.py) and tests.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@tensorrt_llm/_torch/pyexecutor/scheduler.py`:
- Around line 292-307: The custom inner class ScheduledBatch used when calling
request_context(rm.is_draft, scheduled_batch) is missing the all_requests method
expected by request_context, causing an AttributeError in draft mode; fix by
either instantiating/using the existing ScheduledRequests type (which implements
all_requests) instead of ScheduledBatch, or add an all_requests(self) method to
ScheduledBatch that returns the combined list of context_requests +
generation_requests (or the same structure ScheduledRequests.all_requests
returns), and ensure scheduled_batch refers to that implementation before
calling request_context.

In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`:
- Around line 118-121: Add an explicit length equality assertion before zipping
the results to prevent silent truncation by zip(): check that
len(texts_no_evict) == len(texts_max_util) (the outputs returned by the two
llm.generate calls) and raise/fail the test if they differ, then keep the
existing for i, (no_evict, max_util) in enumerate(zip(...)) loop to compare
elements; reference the variables texts_no_evict, texts_max_util and the
zip-based comparison loop when making the change.

🧹 Nitpick comments (6)

tensorrt_llm/_torch/pyexecutor/_util.py (1)

42-43: Prefer module-qualified imports for scheduler types.

Consider importing the scheduler module and referencing scheduler.SimpleScheduler / scheduler.SimpleUnifiedScheduler to keep the namespace intact. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (2)

1-13: Prefer module-qualified imports in this test.

To keep the namespace intact, import the modules and reference types via the module (e.g., import tensorrt_llm as trtllm, import tensorrt_llm.llmapi as llmapi). As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

36-66: Use Google-style docstrings for new helpers/tests.

create_llm and the test docstrings are plain prose; please switch to Google-style (Args/Returns) so they’re Sphinx-friendly. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1692-1711: Use Google-style docstrings for new resource-prep APIs.

The new prepare_resources helper docstrings are not in Google style; please update them to include Args/Returns for Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

tensorrt_llm/_torch/pyexecutor/scheduler.py (2)

6-16: Prefer module-qualified imports for KVCacheV2 helpers.

Consider importing the module and referencing types via the module namespace (e.g., kv_cache_manager_v2._KVCache) to avoid adding unqualified names. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

237-291: Use Google-style docstrings for KVCacheV2MaxUtilizationScheduler.

The new class/method docstrings are prose blocks; please switch to Google-style (Args/Returns) for consistency and Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

coderabbitai · 2026-02-05T08:10:37Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

+        from .resource_manager import request_context
+
+        rm = self.kv_cache_manager
+        evicted_requests = []
+
+        # Create a ScheduledRequests object for context management
+        class ScheduledBatch:
+
+            def __init__(self):
+                self.context_requests = context_requests
+                self.generation_requests = generation_requests
+
+        scheduled_batch = ScheduledBatch()
+
+        with request_context(rm.is_draft, scheduled_batch):
+            new_generation_batch: RequestList = []


⚠️ Potential issue | 🟠 Major

Fix draft-mode crash: ScheduledBatch lacks all_requests.

request_context expects scheduled_requests.all_requests(). The local ScheduledBatch doesn’t implement it, so draft-mode scheduling will raise AttributeError. Use ScheduledRequests or add all_requests.

🐛 Proposed fix

- # Create a ScheduledRequests object for context management - class ScheduledBatch: - - def __init__(self): - self.context_requests = context_requests - self.generation_requests = generation_requests - - scheduled_batch = ScheduledBatch() + # Create a ScheduledRequests object for context management + scheduled_batch = ScheduledRequests() + scheduled_batch.context_requests = context_requests + scheduled_batch.generation_requests = generation_requests

🤖 Prompt for AI Agents

In `@tensorrt_llm/_torch/pyexecutor/scheduler.py` around lines 292 - 307, The custom inner class ScheduledBatch used when calling request_context(rm.is_draft, scheduled_batch) is missing the all_requests method expected by request_context, causing an AttributeError in draft mode; fix by either instantiating/using the existing ScheduledRequests type (which implements all_requests) instead of ScheduledBatch, or add an all_requests(self) method to ScheduledBatch that returns the combined list of context_requests + generation_requests (or the same structure ScheduledRequests.all_requests returns), and ensure scheduled_batch refers to that implementation before calling request_context.

coderabbitai · 2026-02-05T08:10:37Z

tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py

+    for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)):
+        assert no_evict == max_util, (
+            f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}"
+        )


⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file cd /repo && find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 118

🏁 Script executed:

# Find the file in the current repository find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 125

🏁 Script executed:

# Examine the file around lines 118-121 cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '100,140p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1215

🏁 Script executed:

# Check the beginning of the file for copyright header head -20 ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 530

🏁 Script executed:

# Get more context around the comparison code cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '80,125p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2009

🏁 Script executed:

# Check another test file to see the copyright header pattern head -30 ./tests/unittest/_torch/executor/test_overlap_scheduler.py 2>/dev/null | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 973

🏁 Script executed:

# Look for similar zip() patterns in the file to see if this is a broader pattern grep -n "zip(" ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 149

Add explicit length check before comparing outputs to prevent silent truncation by zip().

Both llm.generate() calls receive identical inputs (prompts and sampling_config), so they should produce the same number of outputs. However, zip() silently truncates on length mismatch, which could hide bugs if one call returns fewer outputs. Add an explicit assertion to catch this.

Suggested fix

+ assert len(texts_no_evict) == len(texts_max_util), ( + f"Output length mismatch: {len(texts_no_evict)} vs {len(texts_max_util)}" + ) for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)): assert no_evict == max_util, ( f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}" )

🧰 Tools

🪛 Ruff (0.14.14)

[warning] 118-118: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents

In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines 118 - 121, Add an explicit length equality assertion before zipping the results to prevent silent truncation by zip(): check that len(texts_no_evict) == len(texts_max_util) (the outputs returned by the two llm.generate calls) and raise/fail the test if they differ, then keep the existing for i, (no_evict, max_util) in enumerate(zip(...)) loop to compare elements; reference the variables texts_no_evict, texts_max_util and the zip-based comparison loop when making the change.

yizhang-nv · 2026-02-05T09:31:16Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

+        if is_kv_cache_v2:
+            # For KVCacheManagerV2, use specialized policies
+            if self.scheduler_policy == CapacitySchedulerPolicy.GUARANTEED_NO_EVICT:
+                return KVCacheV2DummyPolicy()


We can remove the KVCacheV2DummyPolicy completely as it will fail under certain cases.

We can leave it like this for now for MTP support to avoid the merge conflict issue. Leave a comment here as a notice.

yizhang-nv · 2026-02-05T09:47:03Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

+            else:
+                scheduled_requests.append(request)
+
+        return scheduled_requests, scheduled_disagg_gen_init_requests, []


For max_util, we may pause requests. However, why the pause req list is empty here?

yizhang-nv · 2026-02-05T10:10:04Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

+
+        return scheduled_requests, scheduled_disagg_gen_init_requests, []
+
+    def prepare_resources(self, context_requests: RequestList,


The overall design exposes lots of details of allocating kv cache which should be hidden inside kv cache manager. (e.g. direct access of internal variables like kv_cache_map). The ideal solution would be using apis like prepare_blocks_if_schedulable to hide the resource allocation details and only tell the scheduler whether we have enough resources to schedule this req or not.

If this will take long time to design the API and overall structure, we can do this in the following pr.

yizhang-nv · 2026-02-05T10:11:55Z

tensorrt_llm/_torch/pyexecutor/resource_manager.py

+        For other policies (GUARANTEED_NO_EVICT), we allocate resources here.
+        """
+        # Check if the scheduler already prepared resources this round
+        if self._scheduler_prepared_resources:


The overall design of kv cache manger v2 is allocating resources in the scheduling stage. We can delete the prepare resource here and only do the assertion here.

yizhang-nv · 2026-02-05T10:12:41Z

tensorrt_llm/_torch/pyexecutor/scheduler.py

+        evicted_requests = []
+
+        # Create a ScheduledRequests object for context management
+        class ScheduledBatch:


Can it be global class not local class here?

TRTLLM-9778 : Implement python based max utilization scheduler based …

6ac5a8d

…on kv cache v2 Signed-off-by: Peace He <peaceh@nvl72162-T06.cm.cluster>

peaceh-nv requested review from lancelly and yizhang-nv February 5, 2026 07:56

peaceh-nv requested review from a team as code owners February 5, 2026 07:56

peaceh-nv requested a review from joyang-nv February 5, 2026 07:56

peaceh-nv changed the title ~~TRTLLM-9778 : Implement python based kv cache v2 max_util scheduler~~ [TRTLLM-9778][feat] : Implement python based kv cache v2 max_util scheduler Feb 5, 2026

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

peaceh-nv changed the title ~~[TRTLLM-9778][feat] : Implement python based kv cache v2 max_util scheduler~~ [TRTLLM-9778][feat] Implement python based kv cache v2 max_util scheduler Feb 5, 2026

peaceh-nv changed the title ~~[TRTLLM-9778][feat] Implement python based kv cache v2 max_util scheduler~~ [TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler Feb 5, 2026

peaceh-nv changed the title ~~[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler~~ [TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler Feb 5, 2026

yizhang-nv reviewed Feb 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307

[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307
peaceh-nv wants to merge 1 commit intoNVIDIA:mainfrom
peaceh-nv:peaceh-max-util

peaceh-nv commented Feb 5, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 5, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

coderabbitai bot Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

yizhang-nv Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		return scheduled_requests, scheduled_disagg_gen_init_requests, []

		def prepare_resources(self, context_requests: RequestList,

Conversation

peaceh-nv commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

coderabbitai bot commented Feb 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

yizhang-nv Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

peaceh-nv commented Feb 5, 2026 •

edited

Loading

coderabbitai bot commented Feb 5, 2026 •

edited

Loading