[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding by Tflowers-0129 · Pull Request #6706 · vllm-project/vllm-ascend

Tflowers-0129 · 2026-02-11T13:37:19Z

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

vLLM version: v0.15.0
vLLM main: vllm-project/vllm@9562912

Signed-off-by: Tflowers-0129 <2906339855@qq.com>

gemini-code-assist · 2026-02-11T13:37:41Z

Summary of Changes

Hello @Tflowers-0129, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers critical bug fixes and enhancements for vLLM's operations on the Ascend 310P platform. The primary focus is on rectifying padding-related issues in key neural network components like linear layers and multi-head attention, and optimizing the SwiGLU activation. By introducing specialized linear layer implementations with precise padding and weight format handling, and integrating native NPU operations, the changes aim to significantly improve the stability, compatibility, and performance of models running on Ascend hardware.

Highlights

Linear Layer Implementations for Ascend 310P: Introduced a comprehensive set of custom linear layers (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, ReplicatedLinear) specifically designed for the Ascend 310P, incorporating padding logic and NPU-specific weight handling.
SwiGLU Activation Bugfix: Resolved a bug in the AscendSiluAndMul310 activation function by switching to torch_npu.npu_swiglu, which correctly handles multi-dimensional inputs and improves performance.
MMEncoder Attention Padding Correction: Fixed padding issues within the AscendMMEncoderAttention310's forward_oot method by dynamically padding query, key, and value tensors to a 16-byte alignment before NPU flash attention, and unpadding the output.
Weight Padding and Format Handling: Implemented robust padding mechanisms (_align_up, _pad_last_dim) and NPU format casting (ACL_FORMAT_FRACTAL_NZ) for linear layer weights to ensure proper operation and efficiency on Ascend hardware.
Unit Tests for Linear Operations: Added new unit tests to validate the correct behavior of the Ascend 310P linear operations, including weight processing, padding alignment, and quantization method initialization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

tests/ut/_310p/ops/test_linear_310p.py
- Added unit tests for Ascend 310P linear operations, verifying weight processing, padding alignment, and quantization method initialization.
vllm_ascend/_310p/ops/activation.py
- Updated the AscendSiluAndMul310 forward method to use torch_npu.npu_swiglu for improved performance and correct handling of multi-dimensional inputs, including reshaping and un-reshaping.
vllm_ascend/_310p/ops/linear.py
- Added new custom linear layer implementations (AscendLinearBase310, AscendUnquantizedLinearMethod310, AscendColumnParallelLinear310, AscendRowParallelLinear310, AscendMergedColumnParallelLinear310, AscendReplicatedLinear310) to support Ascend 310P, incorporating padding logic, weight loading, and NPU format casting.
vllm_ascend/_310p/ops/mm_encoder_attention.py
- Modified the forward_oot method in AscendMMEncoderAttention310 to dynamically pad query, key, and value tensors to a 16-byte alignment before NPU flash attention, and unpad the output, resolving potential padding issues.
- Updated the calculation of scale_value to be more robust.
vllm_ascend/utils.py
- Registered the newly introduced Ascend 310P linear operations and MRotaryEmbedding into the vLLM Ascend custom operator registry.

Activity

No specific activity (comments, reviews, or progress updates) was provided in the pull request description.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces fixes for padding issues on the Ascend 310P platform for mmencoder attention and swiglu activation. The changes include introducing 310P-specific linear layers that handle automatic padding for MLP modules, updating the swiglu activation to use a more robust NPU kernel, and adding padding logic within the mmencoder attention layer for unaligned head sizes. Unit tests for the new linear layers are also added. The changes are well-structured and address the described issues.

I've identified a potential issue in vllm_ascend/utils.py where the QKVParallelLinear layer is not being mapped to a 310P-specific implementation, which could lead to incorrect weight processing on 310P devices. Please see the detailed comment.

As per the repository's style guide, here are the suggested PR title and summary:

Suggested PR Title:

[Ops][BugFix] Fix mmencoder and swiglu padding on 310P

Suggested PR Summary:

### What this PR does / why we need it?

This PR fixes issues with `mmencoder` (multi-modal encoder) attention and `swiglu` activation on the Ascend 310P platform, specifically when input dimensions are not aligned.

- For `swiglu`, the MLP linear layers (`gate_up_proj`, `down_proj`) did not handle padding for unaligned intermediate dimensions. This PR introduces 310P-specific linear layers that automatically pad weights and activations to be aligned to 32 bytes, which is required for `npu_swiglu`.
- For `mmencoder`, the attention mechanism failed when the head size was not a multiple of 16. This PR adds logic to pad the query, key, and value tensors within the attention layer to a 16-byte alignment before calling `_npu_flash_attention_unpad`.
- It also replaces the manual `silu`+`mul` implementation with the optimized `torch_npu.npu_swiglu` kernel.
- New unit tests are added to verify the padding logic in the new linear layers.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix for a specific hardware backend and does not change any user-facing APIs.

### How was this patch tested?

- New unit tests have been added in `tests/ut/_310p/ops/test_linear_310p.py` to verify the padding and alignment logic for the new 310P-specific linear layers.
- CI passed with these new tests.

gemini-code-assist · 2026-02-11T13:40:36Z

vllm_ascend/utils.py

+                "ColumnParallelLinear": AscendColumnParallelLinear310,
+                "RowParallelLinear": AscendRowParallelLinear310,
+                "MergedColumnParallelLinear": AscendMergedColumnParallelLinear310,
+                "QKVParallelLinear": AscendQKVParallelLinear,


The registration for QKVParallelLinear uses AscendQKVParallelLinear from the generic ops directory, not a 310P-specific implementation. The generic AscendQKVParallelLinear calls the non-310P AscendColumnParallelLinear's __init__, which means it will use AscendUnquantizedLinearMethod instead of AscendUnquantizedLinearMethod310. This will cause the qkv_proj layer to miss the 310P-specific weight processing logic (e.g., NZ format conversion in AscendUnquantizedLinearMethod310), which can lead to performance degradation or incorrect behavior on 310P devices.

To fix this, you should create an AscendQKVParallelLinear310 class in vllm_ascend/_310p/ops/linear.py that correctly uses the 310P-specific linear layer logic, and register it here.

github-actions · 2026-02-11T16:26:23Z

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

A PR should do only one thing, smaller PRs enable faster reviews.
Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

WeNeedMoreCode · 2026-02-12T01:31:38Z

vllm_ascend/_310p/ops/linear.py

+    raise RuntimeError(f"1D param mismatch: param={tuple(param_data.shape)} loaded={tuple(loaded.shape)}")
+
+
+class AscendUnquantizedLinearMethod310(UnquantizedLinearMethod):


为何310P需要单独新建UnquantizedLinearMethod，与其他设备有何不同

pu-zhe · 2026-02-11T14:38:44Z

tests/ut/_310p/ops/test_linear_310p.py

@@ -0,0 +1,111 @@
+import math


记得添加License信息

pu-zhe · 2026-02-11T14:41:23Z

vllm_ascend/_310p/ops/activation.py

+        orig_shape = x.shape
+        if x.dim() > 2:
+            x = x.contiguous().view(-1, orig_shape[-1])


300I DUO存在这种情况？A2代码中无此逻辑，可稳定运行。

pu-zhe · 2026-02-11T14:49:08Z

vllm_ascend/_310p/ops/mm_encoder_attention.py

-        query = query.view(bsz * q_len, self.num_heads, self.head_size)
-        key = key.view(bsz * kv_len, self.num_kv_heads, self.head_size)
-        value = value.view(bsz * kv_len, self.num_kv_heads, self.head_size)
+        head_size_real = int(query.shape[-1])
+
+        query = query.view(bsz * q_len, self.num_heads, head_size_real)
+        key = key.view(bsz * kv_len, self.num_kv_heads, head_size_real)
+        value = value.view(bsz * kv_len, self.num_kv_heads, head_size_real)


head_size_real = query.shape[-1] = self.head_size。
此处为冗余更改

pu-zhe · 2026-02-11T14:52:02Z

vllm_ascend/_310p/ops/mm_encoder_attention.py

+        scale = getattr(self, "scale", None)
+        if scale is None:
+            head_size_orig = getattr(self, "head_size_orig", None)
+            if head_size_orig is not None:
+                scale = float(head_size_orig) ** -0.5
+            else:
+                scale = float(head_size_real) ** -0.5
+


此处为冗余代码，不存在self.head_size_orig，self.scale等类属性。使用scale=self.head_size**-0.5，可覆盖pad和非pad两种场景。

pu-zhe · 2026-02-12T00:53:55Z

vllm_ascend/_310p/ops/linear.py

+        self.input_size = int(input_size)
+        self.output_size = int(output_size)


冗余的强制类型转换

pu-zhe · 2026-02-12T01:25:28Z

vllm_ascend/_310p/ops/linear.py

+            self.quant_method = quant_config.get_quant_method(self, prefix=prefix)
+
+
+class AscendColumnParallelLinear310(ColumnParallelLinear):


ColumnParallelLinear类会额外影响到qwen3-next的卷积层，glm4.1v的patch merger proj层，whisper的q_proj层等

pu-zhe · 2026-02-12T01:27:20Z

vllm_ascend/_310p/ops/linear.py

+        return out, out_bias
+
+
+class AscendRowParallelLinear310(RowParallelLinear):


RowParallelLinear类会额外影响到dense层，每个模型的down层命名不一致，有down_proj, fc2，dense_4h_to_h等，行为不可控

pu-zhe · 2026-02-12T01:29:12Z

vllm_ascend/_310p/ops/linear.py

+        return out, (self.bias if self.skip_bias_add else None)
+
+
+class AscendMergedColumnParallelLinear310(MergedColumnParallelLinear):


有的模型kv_proj也会用到MergedColumnParallelLinear

pu-zhe · 2026-02-12T01:31:20Z

vllm_ascend/_310p/ops/linear.py

+        )
+
+
+class AscendReplicatedLinear310(ReplicatedLinear):


ReplicatedLinear用于MOE的门控层，qwen3-next的非TP并行层，以及VL模型的proj层，此处做padding会导致模型上下文维度出现异常。

pu-zhe · 2026-02-12T01:35:41Z

vllm_ascend/_310p/ops/activation.py

+        if x.dim() > 2:
+            x = x.contiguous().view(-1, orig_shape[-1])
+
+        out = torch_npu.npu_swiglu(x)


此处的融合算子性能收益很低，在300I DUO上E2E不足1%。并且可以做判断，

if x.shape[-1] // 32 == 0: out = torch_npu.npu_swiglu(x) else: out = (F.silu(x[..., :h]) * x[..., h:])

github-actions · 2026-02-13T07:43:02Z

This pull request has conflicts, please resolve those before we can evaluate the pull request.

[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding

4ba2889

Signed-off-by: Tflowers-0129 <2906339855@qq.com>

Tflowers-0129 requested a review from wangxiyuan as a code owner February 11, 2026 13:37

gemini-code-assist bot reviewed Feb 11, 2026

View reviewed changes

github-actions bot added module:tests module:core labels Feb 11, 2026

WeNeedMoreCode reviewed Feb 12, 2026

View reviewed changes

pu-zhe reviewed Feb 12, 2026

View reviewed changes

github-actions bot added the merge-conflicts label Feb 13, 2026

		raise RuntimeError(f"1D param mismatch: param={tuple(param_data.shape)} loaded={tuple(loaded.shape)}")


		class AscendUnquantizedLinearMethod310(UnquantizedLinearMethod):

		self.input_size = int(input_size)
		self.output_size = int(output_size)

		self.quant_method = quant_config.get_quant_method(self, prefix=prefix)


		class AscendColumnParallelLinear310(ColumnParallelLinear):

		return out, out_bias


		class AscendRowParallelLinear310(RowParallelLinear):

		return out, (self.bias if self.skip_bias_add else None)


		class AscendMergedColumnParallelLinear310(MergedColumnParallelLinear):

Conversation

Tflowers-0129 commented Feb 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gemini-code-assist bot commented Feb 11, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 11, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tflowers-0129 commented Feb 11, 2026 •

edited by github-actions bot

Loading