-
Notifications
You must be signed in to change notification settings - Fork 1.4k
Open
Description
Severity: High - Will bite you in production
The code itself admits there are memory issues. From litgpt/utils.py:316:
# as a workaround hack, the cross entropy computation is chunked to force it to
# deallocate on the go, reducing the memory spike's magnitudeWhen the developers are calling their own code a "workaround hack", that's not great.
Specific problems:
KV Cache management is sketchy:
# litgpt/model.py:66
if self.mask_cache is not None and self.mask_cache.shape[-1] < value:
print(
f"Warning: KV cache has length {self.mask_cache.shape[-1]} < {value}..."
)This just prints a warning at runtime. It should either:
- Fix the cache size automatically
- Raise an exception BEFORE trying to use it
- Not get into this state in the first place
The chunked cross-entropy thing:
Yes, it works. But it's papering over a real issue - the backward pass is allocating way more memory than it should. This suggests either:
- Something's not getting deallocated properly
- The computation graph is keeping references it shouldn't
- The CUDA memory allocator isn't being triggered when it should
What happens in practice:
- Random OOMs during training that are hard to reproduce
- Memory usage creeps up over time in long-running inference servers
- If you try to use the full context length with large batch sizes, good luck
Fix it properly:
- Profile the memory with PyTorch's memory profiler - find the actual leak
- Make KV cache lifecycle explicit - init, use, clear, destroy
- Add a proper memory budget system instead of chunking hacks
- Test with full context lengths and realistic batch sizes
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels