[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307
[TRTLLM-9778][feat] Implement python based kv_cache_v2 scheduler#11307peaceh-nv wants to merge 1 commit intoNVIDIA:mainfrom
Conversation
…on kv cache v2 Signed-off-by: Peace He <peaceh@nvl72162-T06.cm.cluster>
📝 WalkthroughWalkthroughThe pull request introduces KVCacheV2-based scheduling enhancements to the PyTorch executor system. Changes include: simplifying scheduler selection logic in the utility module, adding scheduler-aware resource preparation tracking to cache managers, introducing KVCacheV2MaxUtilizationScheduler with capacity management logic, and adding corresponding policy wrappers for integration. A new test validates scheduling consistency across policies. Changes
Sequence DiagramsequenceDiagram
participant Client
participant Scheduler as PyCapacityScheduler
participant Policy as KVCacheV2Policy
participant KVMgr as KVCacheManagerV2
participant ResMgr as ResourceManager
Client->>Scheduler: schedule_request(active_requests)
Scheduler->>Scheduler: _create_policy() for KVCacheV2
Scheduler->>Policy: schedule(scheduler, active_requests)
Policy->>Scheduler: delegate scheduling logic
Scheduler->>Scheduler: partition into scheduled/paused/generation
Scheduler->>ResMgr: prepare_resources(context, generation)
ResMgr->>ResMgr: check _scheduler_prepared_resources flag
alt Scheduler Already Prepared
ResMgr->>ResMgr: reset flag, skip allocation
else Scheduler Did Not Prepare
ResMgr->>KVMgr: _prepare_resources_guaranteed_no_evict()
KVMgr->>KVMgr: allocate/adjust KV cache blocks
KVMgr-->>ResMgr: updated request lists
end
ResMgr-->>Scheduler: prepared context/generation requests
Policy->>Policy: apply prepare_resources if available
Scheduler-->>Client: scheduled/paused/generation requests
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@tensorrt_llm/_torch/pyexecutor/scheduler.py`:
- Around line 292-307: The custom inner class ScheduledBatch used when calling
request_context(rm.is_draft, scheduled_batch) is missing the all_requests method
expected by request_context, causing an AttributeError in draft mode; fix by
either instantiating/using the existing ScheduledRequests type (which implements
all_requests) instead of ScheduledBatch, or add an all_requests(self) method to
ScheduledBatch that returns the combined list of context_requests +
generation_requests (or the same structure ScheduledRequests.all_requests
returns), and ensure scheduled_batch refers to that implementation before
calling request_context.
In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py`:
- Around line 118-121: Add an explicit length equality assertion before zipping
the results to prevent silent truncation by zip(): check that
len(texts_no_evict) == len(texts_max_util) (the outputs returned by the two
llm.generate calls) and raise/fail the test if they differ, then keep the
existing for i, (no_evict, max_util) in enumerate(zip(...)) loop to compare
elements; reference the variables texts_no_evict, texts_max_util and the
zip-based comparison loop when making the change.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/_util.py (1)
42-43: Prefer module-qualified imports for scheduler types.Consider importing the scheduler module and referencing
scheduler.SimpleScheduler/scheduler.SimpleUnifiedSchedulerto keep the namespace intact. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py (2)
1-13: Prefer module-qualified imports in this test.To keep the namespace intact, import the modules and reference types via the module (e.g.,
import tensorrt_llm as trtllm,import tensorrt_llm.llmapi as llmapi). As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
36-66: Use Google-style docstrings for new helpers/tests.
create_llmand the test docstrings are plain prose; please switch to Google-style (Args/Returns) so they’re Sphinx-friendly. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
1692-1711: Use Google-style docstrings for new resource-prep APIs.The new prepare_resources helper docstrings are not in Google style; please update them to include Args/Returns for Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
tensorrt_llm/_torch/pyexecutor/scheduler.py (2)
6-16: Prefer module-qualified imports for KVCacheV2 helpers.Consider importing the module and referencing types via the module namespace (e.g.,
kv_cache_manager_v2._KVCache) to avoid adding unqualified names. As per coding guidelines, Always maintain the namespace when importing Python modules, even if only one class or function from a module is used.
237-291: Use Google-style docstrings for KVCacheV2MaxUtilizationScheduler.The new class/method docstrings are prose blocks; please switch to Google-style (Args/Returns) for consistency and Sphinx parsing. As per coding guidelines, Use Google-style docstrings for Python classes and functions, which can be parsed by Sphinx.
| from .resource_manager import request_context | ||
|
|
||
| rm = self.kv_cache_manager | ||
| evicted_requests = [] | ||
|
|
||
| # Create a ScheduledRequests object for context management | ||
| class ScheduledBatch: | ||
|
|
||
| def __init__(self): | ||
| self.context_requests = context_requests | ||
| self.generation_requests = generation_requests | ||
|
|
||
| scheduled_batch = ScheduledBatch() | ||
|
|
||
| with request_context(rm.is_draft, scheduled_batch): | ||
| new_generation_batch: RequestList = [] |
There was a problem hiding this comment.
Fix draft-mode crash: ScheduledBatch lacks all_requests.
request_context expects scheduled_requests.all_requests(). The local ScheduledBatch doesn’t implement it, so draft-mode scheduling will raise AttributeError. Use ScheduledRequests or add all_requests.
🐛 Proposed fix
- # Create a ScheduledRequests object for context management
- class ScheduledBatch:
-
- def __init__(self):
- self.context_requests = context_requests
- self.generation_requests = generation_requests
-
- scheduled_batch = ScheduledBatch()
+ # Create a ScheduledRequests object for context management
+ scheduled_batch = ScheduledRequests()
+ scheduled_batch.context_requests = context_requests
+ scheduled_batch.generation_requests = generation_requests🤖 Prompt for AI Agents
In `@tensorrt_llm/_torch/pyexecutor/scheduler.py` around lines 292 - 307, The
custom inner class ScheduledBatch used when calling request_context(rm.is_draft,
scheduled_batch) is missing the all_requests method expected by request_context,
causing an AttributeError in draft mode; fix by either instantiating/using the
existing ScheduledRequests type (which implements all_requests) instead of
ScheduledBatch, or add an all_requests(self) method to ScheduledBatch that
returns the combined list of context_requests + generation_requests (or the same
structure ScheduledRequests.all_requests returns), and ensure scheduled_batch
refers to that implementation before calling request_context.
| for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)): | ||
| assert no_evict == max_util, ( | ||
| f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}" | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# First, locate and examine the file
cd /repo && find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 118
🏁 Script executed:
# Find the file in the current repository
find . -name "test_kv_cache_v2_scheduler.py" -type f 2>/dev/null | head -20Repository: NVIDIA/TensorRT-LLM
Length of output: 125
🏁 Script executed:
# Examine the file around lines 118-121
cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '100,140p'Repository: NVIDIA/TensorRT-LLM
Length of output: 1215
🏁 Script executed:
# Check the beginning of the file for copyright header
head -20 ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 530
🏁 Script executed:
# Get more context around the comparison code
cat -n ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py | sed -n '80,125p'Repository: NVIDIA/TensorRT-LLM
Length of output: 2009
🏁 Script executed:
# Check another test file to see the copyright header pattern
head -30 ./tests/unittest/_torch/executor/test_overlap_scheduler.py 2>/dev/null | head -30Repository: NVIDIA/TensorRT-LLM
Length of output: 973
🏁 Script executed:
# Look for similar zip() patterns in the file to see if this is a broader pattern
grep -n "zip(" ./tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.pyRepository: NVIDIA/TensorRT-LLM
Length of output: 149
Add explicit length check before comparing outputs to prevent silent truncation by zip().
Both llm.generate() calls receive identical inputs (prompts and sampling_config), so they should produce the same number of outputs. However, zip() silently truncates on length mismatch, which could hide bugs if one call returns fewer outputs. Add an explicit assertion to catch this.
Suggested fix
+ assert len(texts_no_evict) == len(texts_max_util), (
+ f"Output length mismatch: {len(texts_no_evict)} vs {len(texts_max_util)}"
+ )
for i, (no_evict, max_util) in enumerate(zip(texts_no_evict, texts_max_util)):
assert no_evict == max_util, (
f"Output mismatch at index {i}:\nNO_EVICT: {no_evict}\nMAX_UTIL: {max_util}"
)🧰 Tools
🪛 Ruff (0.14.14)
[warning] 118-118: zip() without an explicit strict= parameter
Add explicit value for parameter strict=
(B905)
🤖 Prompt for AI Agents
In `@tests/unittest/_torch/executor/test_kv_cache_v2_scheduler.py` around lines
118 - 121, Add an explicit length equality assertion before zipping the results
to prevent silent truncation by zip(): check that len(texts_no_evict) ==
len(texts_max_util) (the outputs returned by the two llm.generate calls) and
raise/fail the test if they differ, then keep the existing for i, (no_evict,
max_util) in enumerate(zip(...)) loop to compare elements; reference the
variables texts_no_evict, texts_max_util and the zip-based comparison loop when
making the change.
| if is_kv_cache_v2: | ||
| # For KVCacheManagerV2, use specialized policies | ||
| if self.scheduler_policy == CapacitySchedulerPolicy.GUARANTEED_NO_EVICT: | ||
| return KVCacheV2DummyPolicy() |
There was a problem hiding this comment.
We can remove the KVCacheV2DummyPolicy completely as it will fail under certain cases.
There was a problem hiding this comment.
We can leave it like this for now for MTP support to avoid the merge conflict issue. Leave a comment here as a notice.
| else: | ||
| scheduled_requests.append(request) | ||
|
|
||
| return scheduled_requests, scheduled_disagg_gen_init_requests, [] |
There was a problem hiding this comment.
For max_util, we may pause requests. However, why the pause req list is empty here?
|
|
||
| return scheduled_requests, scheduled_disagg_gen_init_requests, [] | ||
|
|
||
| def prepare_resources(self, context_requests: RequestList, |
There was a problem hiding this comment.
The overall design exposes lots of details of allocating kv cache which should be hidden inside kv cache manager. (e.g. direct access of internal variables like kv_cache_map). The ideal solution would be using apis like prepare_blocks_if_schedulable to hide the resource allocation details and only tell the scheduler whether we have enough resources to schedule this req or not.
If this will take long time to design the API and overall structure, we can do this in the following pr.
| For other policies (GUARANTEED_NO_EVICT), we allocate resources here. | ||
| """ | ||
| # Check if the scheduler already prepared resources this round | ||
| if self._scheduler_prepared_resources: |
There was a problem hiding this comment.
The overall design of kv cache manger v2 is allocating resources in the scheduling stage. We can delete the prepare resource here and only do the assertion here.
| evicted_requests = [] | ||
|
|
||
| # Create a ScheduledRequests object for context management | ||
| class ScheduledBatch: |
There was a problem hiding this comment.
Can it be global class not local class here?
Implement python based max utilization scheduler based on kv cache v2