[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding#6706
[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding#6706Tflowers-0129 wants to merge 1 commit intovllm-project:mainfrom
Conversation
Signed-off-by: Tflowers-0129 <2906339855@qq.com>
Summary of ChangesHello @Tflowers-0129, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request delivers critical bug fixes and enhancements for vLLM's operations on the Ascend 310P platform. The primary focus is on rectifying padding-related issues in key neural network components like linear layers and multi-head attention, and optimizing the SwiGLU activation. By introducing specialized linear layer implementations with precise padding and weight format handling, and integrating native NPU operations, the changes aim to significantly improve the stability, compatibility, and performance of models running on Ascend hardware. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces fixes for padding issues on the Ascend 310P platform for mmencoder attention and swiglu activation. The changes include introducing 310P-specific linear layers that handle automatic padding for MLP modules, updating the swiglu activation to use a more robust NPU kernel, and adding padding logic within the mmencoder attention layer for unaligned head sizes. Unit tests for the new linear layers are also added. The changes are well-structured and address the described issues.
I've identified a potential issue in vllm_ascend/utils.py where the QKVParallelLinear layer is not being mapped to a 310P-specific implementation, which could lead to incorrect weight processing on 310P devices. Please see the detailed comment.
As per the repository's style guide, here are the suggested PR title and summary:
Suggested PR Title:
[Ops][BugFix] Fix mmencoder and swiglu padding on 310PSuggested PR Summary:
### What this PR does / why we need it?
This PR fixes issues with `mmencoder` (multi-modal encoder) attention and `swiglu` activation on the Ascend 310P platform, specifically when input dimensions are not aligned.
- For `swiglu`, the MLP linear layers (`gate_up_proj`, `down_proj`) did not handle padding for unaligned intermediate dimensions. This PR introduces 310P-specific linear layers that automatically pad weights and activations to be aligned to 32 bytes, which is required for `npu_swiglu`.
- For `mmencoder`, the attention mechanism failed when the head size was not a multiple of 16. This PR adds logic to pad the query, key, and value tensors within the attention layer to a 16-byte alignment before calling `_npu_flash_attention_unpad`.
- It also replaces the manual `silu`+`mul` implementation with the optimized `torch_npu.npu_swiglu` kernel.
- New unit tests are added to verify the padding logic in the new linear layers.
### Does this PR introduce _any_ user-facing change?
No. This is a bug fix for a specific hardware backend and does not change any user-facing APIs.
### How was this patch tested?
- New unit tests have been added in `tests/ut/_310p/ops/test_linear_310p.py` to verify the padding and alignment logic for the new 310P-specific linear layers.
- CI passed with these new tests.| "ColumnParallelLinear": AscendColumnParallelLinear310, | ||
| "RowParallelLinear": AscendRowParallelLinear310, | ||
| "MergedColumnParallelLinear": AscendMergedColumnParallelLinear310, | ||
| "QKVParallelLinear": AscendQKVParallelLinear, |
There was a problem hiding this comment.
The registration for QKVParallelLinear uses AscendQKVParallelLinear from the generic ops directory, not a 310P-specific implementation. The generic AscendQKVParallelLinear calls the non-310P AscendColumnParallelLinear's __init__, which means it will use AscendUnquantizedLinearMethod instead of AscendUnquantizedLinearMethod310. This will cause the qkv_proj layer to miss the 310P-specific weight processing logic (e.g., NZ format conversion in AscendUnquantizedLinearMethod310), which can lead to performance degradation or incorrect behavior on 310P devices.
To fix this, you should create an AscendQKVParallelLinear310 class in vllm_ascend/_310p/ops/linear.py that correctly uses the 310P-specific linear layer logic, and register it here.
|
👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:
If CI fails, you can run linting and testing checks locally according Contributing and Testing. |
| raise RuntimeError(f"1D param mismatch: param={tuple(param_data.shape)} loaded={tuple(loaded.shape)}") | ||
|
|
||
|
|
||
| class AscendUnquantizedLinearMethod310(UnquantizedLinearMethod): |
There was a problem hiding this comment.
为何310P需要单独新建UnquantizedLinearMethod,与其他设备有何不同
| @@ -0,0 +1,111 @@ | |||
| import math | |||
| orig_shape = x.shape | ||
| if x.dim() > 2: | ||
| x = x.contiguous().view(-1, orig_shape[-1]) |
There was a problem hiding this comment.
300I DUO存在这种情况?A2代码中无此逻辑,可稳定运行。
| query = query.view(bsz * q_len, self.num_heads, self.head_size) | ||
| key = key.view(bsz * kv_len, self.num_kv_heads, self.head_size) | ||
| value = value.view(bsz * kv_len, self.num_kv_heads, self.head_size) | ||
| head_size_real = int(query.shape[-1]) | ||
|
|
||
| query = query.view(bsz * q_len, self.num_heads, head_size_real) | ||
| key = key.view(bsz * kv_len, self.num_kv_heads, head_size_real) | ||
| value = value.view(bsz * kv_len, self.num_kv_heads, head_size_real) |
There was a problem hiding this comment.
head_size_real = query.shape[-1] = self.head_size。
此处为冗余更改
| scale = getattr(self, "scale", None) | ||
| if scale is None: | ||
| head_size_orig = getattr(self, "head_size_orig", None) | ||
| if head_size_orig is not None: | ||
| scale = float(head_size_orig) ** -0.5 | ||
| else: | ||
| scale = float(head_size_real) ** -0.5 | ||
|
|
There was a problem hiding this comment.
此处为冗余代码,不存在self.head_size_orig,self.scale等类属性。使用scale=self.head_size**-0.5,可覆盖pad和非pad两种场景。
| self.input_size = int(input_size) | ||
| self.output_size = int(output_size) |
| self.quant_method = quant_config.get_quant_method(self, prefix=prefix) | ||
|
|
||
|
|
||
| class AscendColumnParallelLinear310(ColumnParallelLinear): |
There was a problem hiding this comment.
ColumnParallelLinear类会额外影响到qwen3-next的卷积层,glm4.1v的patch merger proj层,whisper的q_proj层等
| return out, out_bias | ||
|
|
||
|
|
||
| class AscendRowParallelLinear310(RowParallelLinear): |
There was a problem hiding this comment.
RowParallelLinear类会额外影响到dense层,每个模型的down层命名不一致,有down_proj, fc2,dense_4h_to_h等,行为不可控
| return out, (self.bias if self.skip_bias_add else None) | ||
|
|
||
|
|
||
| class AscendMergedColumnParallelLinear310(MergedColumnParallelLinear): |
There was a problem hiding this comment.
有的模型kv_proj也会用到MergedColumnParallelLinear
| ) | ||
|
|
||
|
|
||
| class AscendReplicatedLinear310(ReplicatedLinear): |
There was a problem hiding this comment.
ReplicatedLinear用于MOE的门控层,qwen3-next的非TP并行层,以及VL模型的proj层,此处做padding会导致模型上下文维度出现异常。
| if x.dim() > 2: | ||
| x = x.contiguous().view(-1, orig_shape[-1]) | ||
|
|
||
| out = torch_npu.npu_swiglu(x) |
There was a problem hiding this comment.
此处的融合算子性能收益很低,在300I DUO上E2E不足1%。并且可以做判断,
if x.shape[-1] // 32 == 0:
out = torch_npu.npu_swiglu(x)
else:
out = (F.silu(x[..., :h]) * x[..., h:])
|
This pull request has conflicts, please resolve those before we can evaluate the pull request. |
What this PR does / why we need it?
Does this PR introduce any user-facing change?
How was this patch tested?