check-in license by lvhan028 · Pull Request #1 · InternLM/lmdeploy

lvhan028 · 2023-06-18T05:52:59Z

No description provided.

This commit implements 4 high-priority kernels to bridge the gap between TurboMind CUDA and PyTorch Triton, enabling cross-platform deployment: 1. GELU and Mul kernel (activation.py) - Fused GELU activation + elementwise multiply - Follows TurboMind's GELU formula: x * 0.5 * (1 + tanh(sqrt(2/π) * (x + 0.044715 * x^3))) - Auto-tuned for different vocab sizes - Estimated speedup: 1.2-1.5x vs unfused PyTorch 2. Top-K Sampling kernel (topk_sampling.py) - High-performance top-k sampling with softmax normalization - Iterative max-finding approach optimized for Triton - Includes topk_filter for logits filtering - Reference PyTorch implementation for testing - Critical for inference quality 3. Top-P (Nucleus) Sampling kernel (topp_sampling.py) - Nucleus sampling with cumulative probability threshold - Greedy nucleus selection for Triton efficiency - Fused softmax + cumsum + sampling - topp_filter for pre-sampling logits filtering - Reference implementation included 4. Embedding Lookup + Position Encoding kernel (embedding_lookup.py) - Fused embedding lookup + position encoding - Three variants: * embedding_lookup: Basic lookup * embedding_lookup_pos_encoding: Fused lookup + pos encoding + scaling * add_position_encoding: Add pos encoding to existing embeddings - Auto-tuned for different hidden dimensions - Memory bandwidth optimized with vectorized loads Additionally: - test_gelu_kernel.py: Comprehensive correctness and performance tests These kernels address critical gaps identified in KERNEL_MIGRATION_CHECKLIST.md: - Sampling: PyTorch backend had only multinomial, now has Top-K/Top-P - Activation: Extended from SiLU to include GELU - Embedding: Enables fused prefill operations Performance targets (vs TurboMind CUDA): - GELU and Mul: ≥95% (simple elementwise) - Embedding Lookup: ≥90% (memory-bound) - Top-K/Top-P Sampling: ≥85% (compute-bound) All kernels support: - FP16/BF16/FP32 precision - Auto-tuning for optimal performance - Cross-platform (CUDA/ROCm/Intel XPU via Triton) Resolves tasks from KERNEL_TODO_QUICK_REF.md: - Task InternLM#8: GELU and Mul ✅ - Task InternLM#1: Top-K Sampling ✅ - Task InternLM#2: Top-P Sampling ✅ - Task InternLM#10: Embedding + Pos Encoding ✅ Next steps: - Performance benchmarking on GPU - Integration tests with lmdeploy models - KV Cache quantization kernels (INT4/INT8)

check-in license

11a1f36

lvhan028 merged commit a75c0a4 into InternLM:main Jun 18, 2023

wangyuwen1999 mentioned this pull request Jul 10, 2023

a bug #81

Closed

alexw994 mentioned this pull request Jul 12, 2023

创建模型时候报错ModelLifeCycle::CreateModel() #102

Closed

seeyourcell mentioned this pull request Nov 1, 2023

针对 w4a16量化的精度有对比吗？试了下 llama2-7b 的差距很大 #632

Closed

orendar mentioned this pull request Nov 23, 2023

[Feature] Support for Mistral #750

Open

jatin-wald mentioned this pull request Nov 7, 2024

[Bug] Deployment of Llama3.1-70b getting struck #2724

Closed

3 tasks

jiabao-wang mentioned this pull request Nov 19, 2024

[Bug] Cannot install torch-npu==2.3.1, torch==2.3.1 and torchvision==0.18.1 because these package versions have conflicting dependencies. #2745

Closed

3 tasks

Sunxiaohu0406 mentioned this pull request Dec 4, 2024

[Bug] glm-4v-9b多卡报错 #2855

Closed

3 tasks

Justin-12138 mentioned this pull request Feb 1, 2025

[Bug] 华为昇腾（Atlas 800T A2）使用lmdeploy #2585

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

check-in license#1

check-in license#1
lvhan028 merged 1 commit intoInternLM:mainfrom
lvhan028:add-license

lvhan028 commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

lvhan028 commented Jun 18, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant