Skip to content

support glm-4.7-flash#4320

Merged
lvhan028 merged 2 commits intoInternLM:mainfrom
RunningLeon:glm4-lite
Feb 5, 2026
Merged

support glm-4.7-flash#4320
lvhan028 merged 2 commits intoInternLM:mainfrom
RunningLeon:glm4-lite

Conversation

@RunningLeon
Copy link
Collaborator

Motivation

support zai-org/GLM-4.7-Flash

Required Issue #4283
After PR #4303

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

  1. Pre-commit or other linting tools are used to fix the potential lint issues.
  2. The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
  3. If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
  4. The documentation has been modified accordingly, like docstring or example tutorials.

Copilot AI review requested due to automatic review settings February 4, 2026 10:53
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for the glm-4.7-flash family by wiring its Hugging Face config and architecture into the existing DeepSeek v2 implementation, and makes rotary-embedding handling compatible with newer Transformers (v5-style) configurations.

Changes:

  • Introduces config-agnostic helpers get_rope_parameters and get_rope_theta, and updates multiple models (DeepSeek v2/v3.2/MTP, Phi-3-MoE, MiniCPM3, InternLM2-VE, Llama4-vision) and build_rotary_embedding_from_config to use them, improving compatibility with rope_parameters / rope_theta layouts.
  • Adds a Glm4MoeLiteModelConfigBuilder and a new module-map entry so that Glm4MoeLiteForCausalLM (GLM-4.7-Flash) is served via the DeepSeek v2 path with appropriate defaults.
  • Tightens FA3 selection in the CUDA attention backend by factoring in head size (head_size <= 256) when deciding to enable FlashAttention-3.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
lmdeploy/pytorch/nn/rotary_embedding.py Adds get_rope_parameters / get_rope_theta helpers and routes all rotary-scaling utilities and build_rotary_embedding_from_config through them to handle both legacy rope_scaling and new rope_parameters/rope_theta configs.
lmdeploy/pytorch/models/phi3_moe.py Switches Phi-3-MoE’s rotary setup to use get_rope_theta and get_rope_parameters so it works with both old and new config layouts.
lmdeploy/pytorch/models/module_map.py Maps Glm4MoeLiteForCausalLM to deepseek_v2.DeepseekV2ForCausalLM to run GLM-4.7-Flash via the DeepSeek v2 implementation.
lmdeploy/pytorch/models/minicpm3.py Updates MiniCPM3 rotary embedding construction to fetch base and scaling via the new rope helper functions.
lmdeploy/pytorch/models/llama4.py Uses get_rope_theta in Llama4VisionRotaryEmbedding so the 2D vision RoPE frequency grid honors both rope_theta and rope_parameters variants.
lmdeploy/pytorch/models/internlm2_ve.py Aligns InternLM2-VE’s rotary setup with get_rope_theta/get_rope_parameters instead of directly using config.rope_scaling / config.rope_theta.
lmdeploy/pytorch/models/deepseek_v32.py Makes DeepSeek-V3.2 attention and model-level rotary embedding read rope parameters via the shared helpers, improving compatibility with updated configs.
lmdeploy/pytorch/models/deepseek_v2.py Updates DeepSeek v2 attention and model rotary emb construction to use get_rope_parameters/get_rope_theta, including a safer default for missing factor.
lmdeploy/pytorch/models/deepseek_mtp.py Adjusts DeepSeek multi-token predictor attention and rotary embedding setup to use the new rope helpers consistently with DeepSeek v2.
lmdeploy/pytorch/configurations/glm4.py Adds Glm4MoeLiteModelConfigBuilder (subclassing DeepseekV2ModelConfigBuilder) to adapt glm4_moe_lite HF configs for the DeepSeek v2 backend, with reasonable default MoE attributes.
lmdeploy/pytorch/backends/cuda/attention/__init__.py Extends FA3 enablement logic to consider head_size and passes it through from the Triton attention builder when deciding whether to use FlashAttention-3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 113 to 121
return dict()

params = FopeParameters()
rope_scaling = config.rope_scaling
rope_scaling = get_rope_parameters(config=config)
params.num_inv_freq = rope_scaling.get('fope_num_inv_freq', rope_scaling.get('num_inv_freq', params.num_inv_freq))
params.num_key_value_heads = config.num_key_value_heads
params.fope_sep_head = rope_scaling['fope_sep_head']
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_get_fope_parameters still checks config.rope_scaling directly when determining whether FOPE is enabled, but then reads the actual values via get_rope_parameters. If a model provides FOPE settings through config.rope_parameters (as in newer transformers v5-style configs) without defining rope_scaling, is_fope will always be false and FOPE will never be enabled even though parameters are present. Consider using get_rope_parameters (or a unified helper) consistently in the FOPE-presence check so that both legacy rope_scaling and new rope_parameters-based configs are handled correctly.

Copilot uses AI. Check for mistakes.
Comment on lines 124 to 129
if rope_type == 'linear':
emb_type = RopeType.LinearScaling
if rope_type == 'dynamic':
emb_type = RopeType.DynamicNTKScaling
else:
raise RuntimeError(f'Unsupported rope type: {rope_type}')
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rope type selection logic uses two separate if statements with the else attached only to the second one, so a valid rope_type == 'linear' will still fall through to the else and raise RuntimeError('Unsupported rope type: linear'). To avoid incorrectly rejecting supported types, the second branch should be elif rope_type == 'dynamic': (or equivalent mutually exclusive logic) so that only truly unknown rope types hit the error path.

Copilot uses AI. Check for mistakes.
Comment on lines 35 to 47
@@ -44,7 +44,7 @@ def _enable_fa3(alibi: bool, learnable_sink: bool, block_sparse_size: int):
Returns:
True if FA3 should be enabled, False otherwise.
"""
enable = not alibi and not learnable_sink and block_sparse_size == 1
enable = not alibi and not learnable_sink and block_sparse_size == 1 and head_size <= 256
Copy link

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation for _enable_fa3 lists the conditions for enabling FA3 but does not mention the new head_size <= 256 requirement that is enforced in the implementation. To keep the behavior discoverable and avoid confusion when larger head sizes silently disable FA3, please update the docstring bullet list to include the head-size constraint.

Copilot uses AI. Check for mistakes.
@lvhan028 lvhan028 added the enhancement New feature or request label Feb 4, 2026
@lvhan028 lvhan028 requested a review from CUHKSZzxy February 4, 2026 12:18
@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 5, 2026

May resolve the conflicts

@lvhan028 lvhan028 requested a review from grimoire February 5, 2026 02:57
@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 5, 2026

May update the support models in readme and user guide as well

@lvhan028
Copy link
Collaborator

lvhan028 commented Feb 5, 2026

@zhulinJulia24 may add glm-4.7-flash into test cases

@lvhan028 lvhan028 merged commit 958acbd into InternLM:main Feb 5, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants