Skip to content

[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307

Open
peaceh-nv wants to merge 1 commit intoNVIDIA:mainfrom
peaceh-nv:peaceh-max-util
Open

[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307
peaceh-nv wants to merge 1 commit intoNVIDIA:mainfrom
peaceh-nv:peaceh-max-util

Conversation

@peaceh-nv
Copy link
Collaborator

@peaceh-nv peaceh-nv commented Feb 5, 2026

Implement python based max utilization scheduler based on kv cache v2

  • Move KVCacheV2DummyScheduler and KVCacheV2MaxUtilizationScheduler to SimpleUnifiedScheduler
  • KV cache manager V2 based Max util scheduler need to do prepare_resources inside schedule_request and no need to do prepare_resources afterwards.

…on kv cache v2

Signed-off-by: Peace He <peaceh@nvl72162-T06.cm.cluster>
@peaceh-nv peaceh-nv requested review from a team as code owners February 5, 2026 07:56
@peaceh-nv peaceh-nv requested a review from joyang-nv February 5, 2026 07:56
@peaceh-nv peaceh-nv changed the title TRTLLM-9778 : Implement python based kv cache v2 max_util scheduler [TRTLLM-9778][feat] : Implement python based kv cache v2 max_util scheduler Feb 5, 2026
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 5, 2026

📝 Walkthrough

Walkthrough

The pull request introduces KVCacheV2-based scheduling enhancements to the PyTorch executor system. Changes include: simplifying scheduler selection logic in the utility module, adding scheduler-aware resource preparation tracking to cache managers, introducing KVCacheV2MaxUtilizationScheduler with capacity management logic, and adding corresponding policy wrappers for integration. A new test validates scheduling consistency across policies.

Changes

Cohort / File(s) Summary
Scheduler Selection and Simplification
tensorrt_llm/_torch/pyexecutor/_util.py
Removed KVCacheV2DummyScheduler handling; simplified scheduler selection to favor SimpleUnifiedScheduler when KVCacheV2 is present or Python scheduler is enabled, with straightforward BindCapacityScheduler fallback.
Resource Preparation and Scheduling Coordination
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Added _scheduler_prepared_resources internal flag to KVCacheManager and KVCacheManagerV2 to track per-round scheduler resource allocation. Introduced _prepare_resources_guaranteed_no_evict() helper methods to decouple scheduler-driven vs. manager-driven resource preparation.
KVCacheV2 Scheduler Classes and Policy Integration
tensorrt_llm/_torch/pyexecutor/scheduler.py
Added KVCacheV2MaxUtilizationScheduler for capacity-aware scheduling with eviction logic and block allocation. Introduced KVCacheV2DummyPolicy and KVCacheV2MaxUtilizationPolicy wrappers for policy-based scheduler wiring. Extended PyCapacityScheduler._create_policy() to instantiate KVCacheV2 policies when using KVCacheManagerV2. Added prepare_resources delegation in SimpleUnifiedScheduler and related classes.
KVCacheV2 Scheduling Test
tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py
New test module validating output consistency between GUARANTEED_NO_EVICT and MAX_UTILIZATION scheduling policies using KvCacheConfigV2 with deterministic sampling across multiple prompts.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Scheduler as PyCapacityScheduler
    participant Policy as KVCacheV2Policy
    participant KVMgr as KVCacheManagerV2
    participant ResMgr as ResourceManager

    Client->>Scheduler: schedule_request(active_requests)
    Scheduler->>Scheduler: _create_policy() for KVCacheV2
    Scheduler->>Policy: schedule(scheduler, active_requests)
    Policy->>Scheduler: delegate scheduling logic
    Scheduler->>Scheduler: partition into scheduled/paused/generation
    Scheduler->>ResMgr: prepare_resources(context, generation)
    ResMgr->>ResMgr: check _scheduler_prepared_resources flag
    alt Scheduler Already Prepared
        ResMgr->>ResMgr: reset flag, skip allocation
    else Scheduler Did Not Prepare
        ResMgr->>KVMgr: _prepare_resources_guaranteed_no_evict()
        KVMgr->>KVMgr: allocate/adjust KV cache blocks
        KVMgr-->>ResMgr: updated request lists
    end
    ResMgr-->>Scheduler: prepared context/generation requests
    Policy->>Policy: apply prepare_resources if available
    Scheduler-->>Client: scheduled/paused/generation requests
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 1 | ❌ 2
❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 51.85% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ⚠️ Warning PR description lacks required detail: missing explanation of the issue/motivation, test coverage details, and incomplete checklist items. Expand the Description section with clear motivation and technical explanation. Add specific test cases under Test Coverage. Complete the PR Checklist by reviewing and checking all applicable items.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main change: implementing a Python-based KV cache V2 scheduler, which is reflected across multiple files (_util.py, resource_manager.py, scheduler.py) and tests.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/scheduler.py`:
- Around line 292-307: The custom inner class ScheduledBatch used when calling
request_context(rm.is_draft, scheduled_batch) is missing the all_requests method
expected by request_context, causing an AttributeError in draft mode; fix by
either instantiating/using the existing ScheduledRequests type (which implements
all_requests) instead of ScheduledBatch, or add an all_requests(self) method to
ScheduledBatch that returns the combined list of context_requests +
generation_requests (or the same structure ScheduledRequests.all_requests
returns), and ensure scheduled_batch refers to that implementation before
calling request_context.

In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`:
- Around line 118-121: Add an explicit length equality assertion before zipping
the results to prevent silent truncation by zip(): check that
len(texts_no_evict) == len(texts_max_util) (the outputs returned by the two
llm.generate calls) and raise/fail the test if they differ, then keep the
existing for i, (no_evict, max_util) in enumerate(zip(...)) loop to compare
elements; reference the variables texts_no_evict, texts_max_util and the
zip-based comparison loop when making the change.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/_util.py (1)

42-43: Prefer module-qualified imports for scheduler types.

Consider importing the scheduler module and referencing scheduler.SimpleScheduler / scheduler.SimpleUnifiedScheduler to keep the namespace intact. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.

tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (2)

1-13: Prefer module-qualified imports in this test.

To keep the namespace intact, import the modules and reference types via the module (e.g., import tensorrt_llm as trtllm, import tensorrt_llm.llmapi as llmapi). As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.


36-66: Use Google-style docstrings for new helpers/tests.

create_llm and the test docstrings are plain prose; please switch to Google-style (Args/Returns) so they’re Sphinx-friendly. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

1692-1711: Use Google-style docstrings for new resource-prep APIs.

The new prepare_resources helper docstrings are not in Google style; please update them to include Args/Returns for Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

tensorrt_llm/_torch/pyexecutor/scheduler.py (2)

6-16: Prefer module-qualified imports for KVCacheV2 helpers.

Consider importing the module and referencing types via the module namespace (e.g., kv_cache_manager_v2._KVCache) to avoid adding unqualified names. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.


237-291: Use Google-style docstrings for KVCacheV2MaxUtilizationScheduler.

The new class/method docstrings are prose blocks; please switch to Google-style (Args/Returns) for consistency and Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.

Comment on lines +292 to +307
from .resource_manager import request_context

rm = self.kv_cache_manager
evicted_requests = []

# Create a ScheduledRequests object for context management
class ScheduledBatch:

def __init__(self):
self.context_requests = context_requests
self.generation_requests = generation_requests

scheduled_batch = ScheduledBatch()

with request_context(rm.is_draft, scheduled_batch):
new_generation_batch: RequestList = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Fix draft-mode crash: ScheduledBatch lacks all_requests.

request_context expects scheduled_requests.all_requests(). The local ScheduledBatch doesn’t implement it, so draft-mode scheduling will raise AttributeError. Use ScheduledRequests or add all_requests.

🐛 Proposed fix
-        # Create a ScheduledRequests object for context management
-        class ScheduledBatch:
-
-            def __init__(self):
-                self.context_requests = context_requests
-                self.generation_requests = generation_requests
-
-        scheduled_batch = ScheduledBatch()
+        # Create a ScheduledRequests object for context management
+        scheduled_batch = ScheduledRequests()
+        scheduled_batch.context_requests = context_requests
+        scheduled_batch.generation_requests = generation_requests
🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/pyexecutor/scheduler.py` around lines 292 - 307, The
custom inner class ScheduledBatch used when calling request_context(rm.is_draft,
scheduled_batch) is missing the all_requests method expected by request_context,
causing an AttributeError in draft mode; fix by either instantiating/using the
existing ScheduledRequests type (which implements all_requests) instead of
ScheduledBatch, or add an all_requests(self) method to ScheduledBatch that
returns the combined list of context_requests + generation_requests (or the same
structure ScheduledRequests.all_requests returns), and ensure scheduled_batch
refers to that implementation before calling request_context.

Comment on lines +118 to +121
for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)):
assert no_evict == max_util, (
f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}"
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

# First, locate and examine the file
cd /repo && find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 118


🏁 Script executed:

# Find the file in the current repository
find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20

Repository: NVIDIA/TensorRT-LLM

Length of output: 125


🏁 Script executed:

# Examine the file around lines 118-121
cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '100,140p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 1215


🏁 Script executed:

# Check the beginning of the file for copyright header
head -20 ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 530


🏁 Script executed:

# Get more context around the comparison code
cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '80,125p'

Repository: NVIDIA/TensorRT-LLM

Length of output: 2009


🏁 Script executed:

# Check another test file to see the copyright header pattern
head -30 ./tests/unittest/_torch/executor/test_overlap_scheduler.py 2>/dev/null | head -30

Repository: NVIDIA/TensorRT-LLM

Length of output: 973


🏁 Script executed:

# Look for similar zip() patterns in the file to see if this is a broader pattern
grep -n "zip(" ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py

Repository: NVIDIA/TensorRT-LLM

Length of output: 149


Add explicit length check before comparing outputs to prevent silent truncation by zip().

Both llm.generate() calls receive identical inputs (prompts and sampling_config), so they should produce the same number of outputs. However, zip() silently truncates on length mismatch, which could hide bugs if one call returns fewer outputs. Add an explicit assertion to catch this.

Suggested fix
+    assert len(texts_no_evict) == len(texts_max_util), (
+        f"Output length mismatch: {len(texts_no_evict)} vs {len(texts_max_util)}"
+    )
     for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)):
         assert no_evict == max_util, (
             f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}"
         )
🧰 Tools
🪛 Ruff (0.14.14)

[warning] 118-118: zip() without an explicit strict= parameter

Add explicit value for parameter strict=

(B905)

🤖 Prompt for AI Agents
In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines
118 - 121, Add an explicit length equality assertion before zipping the results
to prevent silent truncation by zip(): check that len(texts_no_evict) ==
len(texts_max_util) (the outputs returned by the two llm.generate calls) and
raise/fail the test if they differ, then keep the existing for i, (no_evict,
max_util) in enumerate(zip(...)) loop to compare elements; reference the
variables texts_no_evict, texts_max_util and the zip-based comparison loop when
making the change.

@peaceh-nv peaceh-nv changed the title [TRTLLM-9778][feat] : Implement python based kv cache v2 max_util scheduler [TRTLLM-9778][feat] Implement python based kv cache v2 max_util scheduler Feb 5, 2026
@peaceh-nv peaceh-nv changed the title [TRTLLM-9778][feat] Implement python based kv cache v2 max_util scheduler [TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler Feb 5, 2026
@peaceh-nv peaceh-nv changed the title [TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler [TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler Feb 5, 2026
if is_kv_cache_v2:
# For KVCacheManagerV2, use specialized policies
if self.scheduler_policy == CapacitySchedulerPolicy.GUARANTEED_NO_EVICT:
return KVCacheV2DummyPolicy()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can remove the KVCacheV2DummyPolicy completely as it will fail under certain cases.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave it like this for now for MTP support to avoid the merge conflict issue. Leave a comment here as a notice.

else:
scheduled_requests.append(request)

return scheduled_requests, scheduled_disagg_gen_init_requests, []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For max_util, we may pause requests. However, why the pause req list is empty here?


return scheduled_requests, scheduled_disagg_gen_init_requests, []

def prepare_resources(self, context_requests: RequestList,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design exposes lots of details of allocating kv cache which should be hidden inside kv cache manager. (e.g. direct access of internal variables like kv_cache_map). The ideal solution would be using apis like prepare_blocks_if_schedulable to hide the resource allocation details and only tell the scheduler whether we have enough resources to schedule this req or not.

If this will take long time to design the API and overall structure, we can do this in the following pr.

For other policies (GUARANTEED_NO_EVICT), we allocate resources here.
"""
# Check if the scheduler already prepared resources this round
if self._scheduler_prepared_resources:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall design of kv cache manger v2 is allocating resources in the scheduling stage. We can delete the prepare resource here and only do the assertion here.

evicted_requests = []

# Create a ScheduledRequests object for context management
class ScheduledBatch:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be global class not local class here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants