Skip to content

Commit a1e8a6d

Browse files
committed
Update table of contents.
Signed-off-by: Bo Li <22713281+bobboli@users.noreply.github.com>
1 parent e053376 commit a1e8a6d

File tree

1 file changed

+20
-6
lines changed

1 file changed

+20
-6
lines changed

docs/source/blogs/tech_blog/blog16_Accelerating_Long_Context_Inference_with_Skip_Softmax_Attention.md

Lines changed: 20 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,20 @@ In the previous [tech blog](https://github.com/heyuhhh/TensorRT-LLM/blob/user/yu
44

55
In this blog, we introduce **Skip Softmax Attention**, a drop-in sparse attention technique that is designed to accelerate the existing pretrained models that use standard attention mechanisms like MHA, GQA, or MLA. Skip Softmax Attention based on top of the Flash Attention algorithm and only requires modifying the existing **attention kernels**. Due to this simplicity, the end-to-end performance gain is more predictable. In addition, it is only an approximation method of the attention kernel computation, making it compatible with nearly all the other features, such as FP8 attention, KV cache reuse, chunked prefill etc.
66

7+
## Table of Contents
8+
- [Accelerating Long-Context Inference with Skip Softmax Attention](#accelerating-long-context-inference-with-skip-softmax-attention)
9+
- [Table of Contents](#table-of-contents)
10+
- [Method Overview](#method-overview)
11+
- [Example Usage](#example-usage)
12+
- [Accuracy Evaluation](#accuracy-evaluation)
13+
- [Performance Benchmark](#performance-benchmark)
14+
- [Kernel Performance](#kernel-performance)
15+
- [End-to-end Performance](#end-to-end-performance)
16+
- [Reproduction](#reproduction)
17+
- [Accuracy evaluation (LongBench V1/V2)](#accuracy-evaluation-longbench-v1v2)
18+
- [End-to-end performance (TTFT/TPOT)](#end-to-end-performance-ttfttpot)
19+
- [Conclusion](#conclusion)
20+
721
## Method Overview
822

923
The idea of Skip Softmax Attention is to compare the local maximum ($\tilde{m}_i^{(j)}$) of $Q \cdot K^T$ with the running global maximum ($m_i^{(j)}$), and skip the softmax (exp) and BMM2 calculation for blocks that are below a certain threshold $\lambda$:
@@ -215,12 +229,12 @@ trtllm-eval --model Qwen/Qwen3-30B-A3B-Instruct-2507 \
215229
--max_batch_size 1 --max_num_tokens 262144 \
216230
--extra_llm_api_options extra_llm_api_options.yaml \
217231
longbench_v2 \
218-
# Medium subset of LongBench V2
219-
--length medium \
220-
# Truncate the prompt length to 256k
221-
--max_input_length 256000 \
222-
# Dump dataset for perf benching
223-
--output_dir ${OUTPUT_DIR}
232+
# Medium subset of LongBench V2
233+
--length medium \
234+
# Truncate the prompt length to 256k
235+
--max_input_length 256000 \
236+
# Dump dataset for perf benching
237+
--output_dir ${OUTPUT_DIR}
224238
```
225239

226240
### End-to-end performance (TTFT/TPOT)

0 commit comments

Comments
 (0)