Skip to content

[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding#6706

Open
Tflowers-0129 wants to merge 1 commit intovllm-project:mainfrom
Tflowers-0129:paddingnewnew
Open

[Bugfix][310P]: fix mmencoder&swiglu cannot use via padding#6706
Tflowers-0129 wants to merge 1 commit intovllm-project:mainfrom
Tflowers-0129:paddingnewnew

Conversation

@Tflowers-0129
Copy link
Contributor

@Tflowers-0129 Tflowers-0129 commented Feb 11, 2026

What this PR does / why we need it?

Does this PR introduce any user-facing change?

How was this patch tested?

Signed-off-by: Tflowers-0129 <2906339855@qq.com>
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Tflowers-0129, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request delivers critical bug fixes and enhancements for vLLM's operations on the Ascend 310P platform. The primary focus is on rectifying padding-related issues in key neural network components like linear layers and multi-head attention, and optimizing the SwiGLU activation. By introducing specialized linear layer implementations with precise padding and weight format handling, and integrating native NPU operations, the changes aim to significantly improve the stability, compatibility, and performance of models running on Ascend hardware.

Highlights

  • Linear Layer Implementations for Ascend 310P: Introduced a comprehensive set of custom linear layers (ColumnParallelLinear, RowParallelLinear, MergedColumnParallelLinear, ReplicatedLinear) specifically designed for the Ascend 310P, incorporating padding logic and NPU-specific weight handling.
  • SwiGLU Activation Bugfix: Resolved a bug in the AscendSiluAndMul310 activation function by switching to torch_npu.npu_swiglu, which correctly handles multi-dimensional inputs and improves performance.
  • MMEncoder Attention Padding Correction: Fixed padding issues within the AscendMMEncoderAttention310's forward_oot method by dynamically padding query, key, and value tensors to a 16-byte alignment before NPU flash attention, and unpadding the output.
  • Weight Padding and Format Handling: Implemented robust padding mechanisms (_align_up, _pad_last_dim) and NPU format casting (ACL_FORMAT_FRACTAL_NZ) for linear layer weights to ensure proper operation and efficiency on Ascend hardware.
  • Unit Tests for Linear Operations: Added new unit tests to validate the correct behavior of the Ascend 310P linear operations, including weight processing, padding alignment, and quantization method initialization.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • tests/ut/_310p/ops/test_linear_310p.py
    • Added unit tests for Ascend 310P linear operations, verifying weight processing, padding alignment, and quantization method initialization.
  • vllm_ascend/_310p/ops/activation.py
    • Updated the AscendSiluAndMul310 forward method to use torch_npu.npu_swiglu for improved performance and correct handling of multi-dimensional inputs, including reshaping and un-reshaping.
  • vllm_ascend/_310p/ops/linear.py
    • Added new custom linear layer implementations (AscendLinearBase310, AscendUnquantizedLinearMethod310, AscendColumnParallelLinear310, AscendRowParallelLinear310, AscendMergedColumnParallelLinear310, AscendReplicatedLinear310) to support Ascend 310P, incorporating padding logic, weight loading, and NPU format casting.
  • vllm_ascend/_310p/ops/mm_encoder_attention.py
    • Modified the forward_oot method in AscendMMEncoderAttention310 to dynamically pad query, key, and value tensors to a 16-byte alignment before NPU flash attention, and unpad the output, resolving potential padding issues.
    • Updated the calculation of scale_value to be more robust.
  • vllm_ascend/utils.py
    • Registered the newly introduced Ascend 310P linear operations and MRotaryEmbedding into the vLLM Ascend custom operator registry.
Activity
  • No specific activity (comments, reviews, or progress updates) was provided in the pull request description.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fixes for padding issues on the Ascend 310P platform for mmencoder attention and swiglu activation. The changes include introducing 310P-specific linear layers that handle automatic padding for MLP modules, updating the swiglu activation to use a more robust NPU kernel, and adding padding logic within the mmencoder attention layer for unaligned head sizes. Unit tests for the new linear layers are also added. The changes are well-structured and address the described issues.

I've identified a potential issue in vllm_ascend/utils.py where the QKVParallelLinear layer is not being mapped to a 310P-specific implementation, which could lead to incorrect weight processing on 310P devices. Please see the detailed comment.

As per the repository's style guide, here are the suggested PR title and summary:

Suggested PR Title:

[Ops][BugFix] Fix mmencoder and swiglu padding on 310P

Suggested PR Summary:

### What this PR does / why we need it?

This PR fixes issues with `mmencoder` (multi-modal encoder) attention and `swiglu` activation on the Ascend 310P platform, specifically when input dimensions are not aligned.

- For `swiglu`, the MLP linear layers (`gate_up_proj`, `down_proj`) did not handle padding for unaligned intermediate dimensions. This PR introduces 310P-specific linear layers that automatically pad weights and activations to be aligned to 32 bytes, which is required for `npu_swiglu`.
- For `mmencoder`, the attention mechanism failed when the head size was not a multiple of 16. This PR adds logic to pad the query, key, and value tensors within the attention layer to a 16-byte alignment before calling `_npu_flash_attention_unpad`.
- It also replaces the manual `silu`+`mul` implementation with the optimized `torch_npu.npu_swiglu` kernel.
- New unit tests are added to verify the padding logic in the new linear layers.

### Does this PR introduce _any_ user-facing change?

No. This is a bug fix for a specific hardware backend and does not change any user-facing APIs.

### How was this patch tested?

- New unit tests have been added in `tests/ut/_310p/ops/test_linear_310p.py` to verify the padding and alignment logic for the new 310P-specific linear layers.
- CI passed with these new tests.

"ColumnParallelLinear": AscendColumnParallelLinear310,
"RowParallelLinear": AscendRowParallelLinear310,
"MergedColumnParallelLinear": AscendMergedColumnParallelLinear310,
"QKVParallelLinear": AscendQKVParallelLinear,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The registration for QKVParallelLinear uses AscendQKVParallelLinear from the generic ops directory, not a 310P-specific implementation. The generic AscendQKVParallelLinear calls the non-310P AscendColumnParallelLinear's __init__, which means it will use AscendUnquantizedLinearMethod instead of AscendUnquantizedLinearMethod310. This will cause the qkv_proj layer to miss the 310P-specific weight processing logic (e.g., NZ format conversion in AscendUnquantizedLinearMethod310), which can lead to performance degradation or incorrect behavior on 310P devices.

To fix this, you should create an AscendQKVParallelLinear310 class in vllm_ascend/_310p/ops/linear.py that correctly uses the 310P-specific linear layer logic, and register it here.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the vLLM Ascend project. The following points will speed up your PR merge:‌‌

  • A PR should do only one thing, smaller PRs enable faster reviews.
  • Every PR should include unit tests and end-to-end tests ‌to ensure it works and is not broken by other future PRs.
  • Write the commit message by fulfilling the PR description to help reviewer and future developers understand.

If CI fails, you can run linting and testing checks locally according Contributing and Testing.

raise RuntimeError(f"1D param mismatch: param={tuple(param_data.shape)} loaded={tuple(loaded.shape)}")


class AscendUnquantizedLinearMethod310(UnquantizedLinearMethod):

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

为何310P需要单独新建UnquantizedLinearMethod,与其他设备有何不同

@@ -0,0 +1,111 @@
import math
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

记得添加License信息

Comment on lines +27 to +29
orig_shape = x.shape
if x.dim() > 2:
x = x.contiguous().view(-1, orig_shape[-1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

300I DUO存在这种情况?A2代码中无此逻辑,可稳定运行。

Comment on lines -39 to +47
query = query.view(bsz * q_len, self.num_heads, self.head_size)
key = key.view(bsz * kv_len, self.num_kv_heads, self.head_size)
value = value.view(bsz * kv_len, self.num_kv_heads, self.head_size)
head_size_real = int(query.shape[-1])

query = query.view(bsz * q_len, self.num_heads, head_size_real)
key = key.view(bsz * kv_len, self.num_kv_heads, head_size_real)
value = value.view(bsz * kv_len, self.num_kv_heads, head_size_real)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

head_size_real = query.shape[-1] = self.head_size。
此处为冗余更改

Comment on lines +61 to +68
scale = getattr(self, "scale", None)
if scale is None:
head_size_orig = getattr(self, "head_size_orig", None)
if head_size_orig is not None:
scale = float(head_size_orig) ** -0.5
else:
scale = float(head_size_real) ** -0.5

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处为冗余代码,不存在self.head_size_orig,self.scale等类属性。使用scale=self.head_size**-0.5,可覆盖pad和非pad两种场景。

Comment on lines +210 to +211
self.input_size = int(input_size)
self.output_size = int(output_size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

冗余的强制类型转换

self.quant_method = quant_config.get_quant_method(self, prefix=prefix)


class AscendColumnParallelLinear310(ColumnParallelLinear):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnParallelLinear类会额外影响到qwen3-next的卷积层,glm4.1v的patch merger proj层,whisper的q_proj层等

return out, out_bias


class AscendRowParallelLinear310(RowParallelLinear):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RowParallelLinear类会额外影响到dense层,每个模型的down层命名不一致,有down_proj, fc2,dense_4h_to_h等,行为不可控

return out, (self.bias if self.skip_bias_add else None)


class AscendMergedColumnParallelLinear310(MergedColumnParallelLinear):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

有的模型kv_proj也会用到MergedColumnParallelLinear

)


class AscendReplicatedLinear310(ReplicatedLinear):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ReplicatedLinear用于MOE的门控层,qwen3-next的非TP并行层,以及VL模型的proj层,此处做padding会导致模型上下文维度出现异常。

if x.dim() > 2:
x = x.contiguous().view(-1, orig_shape[-1])

out = torch_npu.npu_swiglu(x)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

此处的融合算子性能收益很低,在300I DUO上E2E不足1%。并且可以做判断,

if x.shape[-1] // 32 == 0:
    out = torch_npu.npu_swiglu(x)
else:
    out = (F.silu(x[..., :h]) * x[..., h:])

@github-actions
Copy link

This pull request has conflicts, please resolve those before we can evaluate the pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants