check-in license#1
Merged
lvhan028 merged 1 commit intoInternLM:mainfrom Jun 18, 2023
lvhan028:add-license
Merged
Conversation
Closed
3 tasks
3 tasks
3 tasks
roy-shih
pushed a commit
to roy-shih/lmdeploy
that referenced
this pull request
Nov 24, 2025
This commit implements 4 high-priority kernels to bridge the gap between
TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment:
1. GELU and Mul kernel (activation.py)
- Fused GELU activation + elementwise multiply
- Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3)))
- Auto-tuned for different vocab sizes
- Estimated speedup: 1.2-1.5x vs unfused PyTorch
2. Top-K Sampling kernel (topk_sampling.py)
- High-performance top-k sampling with softmax normalization
- Iterative max-finding approach optimized for Triton
- Includes topk_filter for logits filtering
- Reference PyTorch implementation for testing
- Critical for inference quality
3. Top-P (Nucleus) Sampling kernel (topp_sampling.py)
- Nucleus sampling with cumulative probability threshold
- Greedy nucleus selection for Triton efficiency
- Fused softmax + cumsum + sampling
- topp_filter for pre-sampling logits filtering
- Reference implementation included
4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py)
- Fused embedding lookup + position encoding
- Three variants:
* embedding_lookup: Basic lookup
* embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling
* add_position_encoding: Add pos encoding to existing embeddings
- Auto-tuned for different hidden dimensions
- Memory bandwidth optimized with vectorized loads
Additionally:
- test_gelu_kernel.py: Comprehensive correctness and performance tests
These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md:
- Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P
- Activation: Extended from SiLU to include GELU
- Embedding: Enables fused prefill operations
Performance targets (vs TurboMind CUDA):
- GELU and Mul: ≥95% (simple elementwise)
- Embedding Lookup: ≥90% (memory-bound)
- Top-K/Top-P Sampling: ≥85% (compute-bound)
All kernels support:
- FP16/BF16/FP32 precision
- Auto-tuning for optimal performance
- Cross-platform (CUDA/ROCm/Intel XPU via Triton)
Resolves tasks from KERNEL_TODO_QUICK_REF.md:
- Task InternLM#8: GELU and Mul ✅
- Task InternLM#1: Top-K Sampling ✅
- Task InternLM#2: Top-P Sampling ✅
- Task InternLM#10: Embedding + Pos Encoding ✅
Next steps:
- Performance benchmarking on GPU
- Integration tests with lmdeploy models
- KV Cache quantization kernels (INT4/INT8)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.