support glm-4.7-flash by RunningLeon · Pull Request #4320 · InternLM/lmdeploy

RunningLeon · 2026-02-04T10:53:04Z

Motivation

Required Issue #4283
After PR #4303

Modification

Please briefly describe what modification is made in this PR.

BC-breaking (Optional)

Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.

Use cases (Optional)

If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.

Checklist

Pre-commit or other linting tools are used to fix the potential lint issues.
The modification is covered by complete unit tests. If not, please add more unit tests to ensure the correctness.
If the modification has a dependency on downstream projects of a newer version, this PR should be tested with all supported versions of downstream projects.
The documentation has been modified accordingly, like docstring or example tutorials.

Copilot

Pull request overview

This PR adds support for the glm-4.7-flash family by wiring its Hugging Face config and architecture into the existing DeepSeek v2 implementation, and makes rotary-embedding handling compatible with newer Transformers (v5-style) configurations.

Changes:

Introduces config-agnostic helpers get_rope_parameters and get_rope_theta, and updates multiple models (DeepSeek v2/v3.2/MTP, Phi-3-MoE, MiniCPM3, InternLM2-VE, Llama4-vision) and build_rotary_embedding_from_config to use them, improving compatibility with rope_parameters / rope_theta layouts.
Adds a Glm4MoeLiteModelConfigBuilder and a new module-map entry so that Glm4MoeLiteForCausalLM (GLM-4.7-Flash) is served via the DeepSeek v2 path with appropriate defaults.
Tightens FA3 selection in the CUDA attention backend by factoring in head size (head_size <= 256) when deciding to enable FlashAttention-3.

Reviewed changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
`lmdeploy/pytorch/nn/rotary_embedding.py`	Adds `get_rope_parameters` / `get_rope_theta` helpers and routes all rotary-scaling utilities and `build_rotary_embedding_from_config` through them to handle both legacy `rope_scaling` and new `rope_parameters`/`rope_theta` configs.
`lmdeploy/pytorch/models/phi3_moe.py`	Switches Phi-3-MoE’s rotary setup to use `get_rope_theta` and `get_rope_parameters` so it works with both old and new config layouts.
`lmdeploy/pytorch/models/module_map.py`	Maps `Glm4MoeLiteForCausalLM` to `deepseek_v2.DeepseekV2ForCausalLM` to run GLM-4.7-Flash via the DeepSeek v2 implementation.
`lmdeploy/pytorch/models/minicpm3.py`	Updates MiniCPM3 rotary embedding construction to fetch base and scaling via the new rope helper functions.
`lmdeploy/pytorch/models/llama4.py`	Uses `get_rope_theta` in `Llama4VisionRotaryEmbedding` so the 2D vision RoPE frequency grid honors both `rope_theta` and `rope_parameters` variants.
`lmdeploy/pytorch/models/internlm2_ve.py`	Aligns InternLM2-VE’s rotary setup with `get_rope_theta`/`get_rope_parameters` instead of directly using `config.rope_scaling` / `config.rope_theta`.
`lmdeploy/pytorch/models/deepseek_v32.py`	Makes DeepSeek-V3.2 attention and model-level rotary embedding read rope parameters via the shared helpers, improving compatibility with updated configs.
`lmdeploy/pytorch/models/deepseek_v2.py`	Updates DeepSeek v2 attention and model rotary emb construction to use `get_rope_parameters`/`get_rope_theta`, including a safer default for missing `factor`.
`lmdeploy/pytorch/models/deepseek_mtp.py`	Adjusts DeepSeek multi-token predictor attention and rotary embedding setup to use the new rope helpers consistently with DeepSeek v2.
`lmdeploy/pytorch/configurations/glm4.py`	Adds `Glm4MoeLiteModelConfigBuilder` (subclassing `DeepseekV2ModelConfigBuilder`) to adapt `glm4_moe_lite` HF configs for the DeepSeek v2 backend, with reasonable default MoE attributes.
`lmdeploy/pytorch/backends/cuda/attention/__init__.py`	Extends FA3 enablement logic to consider `head_size` and passes it through from the Triton attention builder when deciding whether to use FlashAttention-3.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-04T10:58:44Z

lmdeploy/pytorch/nn/rotary_embedding.py

        return dict()

    params = FopeParameters()
-    rope_scaling = config.rope_scaling
+    rope_scaling = get_rope_parameters(config=config)
    params.num_inv_freq = rope_scaling.get('fope_num_inv_freq', rope_scaling.get('num_inv_freq', params.num_inv_freq))
    params.num_key_value_heads = config.num_key_value_heads
    params.fope_sep_head = rope_scaling['fope_sep_head']


_get_fope_parameters still checks config.rope_scaling directly when determining whether FOPE is enabled, but then reads the actual values via get_rope_parameters. If a model provides FOPE settings through config.rope_parameters (as in newer transformers v5-style configs) without defining rope_scaling, is_fope will always be false and FOPE will never be enabled even though parameters are present. Consider using get_rope_parameters (or a unified helper) consistently in the FOPE-presence check so that both legacy rope_scaling and new rope_parameters-based configs are handled correctly.

Copilot · 2026-02-04T10:58:44Z

lmdeploy/pytorch/models/internlm2_ve.py

            if rope_type == 'linear':
                emb_type = RopeType.LinearScaling
            if rope_type == 'dynamic':
                emb_type = RopeType.DynamicNTKScaling
            else:
                raise RuntimeError(f'Unsupported rope type: {rope_type}')


The rope type selection logic uses two separate if statements with the else attached only to the second one, so a valid rope_type == 'linear' will still fall through to the else and raise RuntimeError('Unsupported rope type: linear'). To avoid incorrectly rejecting supported types, the second branch should be elif rope_type == 'dynamic': (or equivalent mutually exclusive logic) so that only truly unknown rope types hit the error path.

Copilot · 2026-02-04T10:58:45Z

lmdeploy/pytorch/backends/cuda/attention/__init__.py

@@ -44,7 +44,7 @@ def _enable_fa3(alibi: bool, learnable_sink: bool, block_sparse_size: int):
    Returns:
        True if FA3 should be enabled, False otherwise.
    """
-    enable = not alibi and not learnable_sink and block_sparse_size == 1
+    enable = not alibi and not learnable_sink and block_sparse_size == 1 and head_size <= 256


The documentation for _enable_fa3 lists the conditions for enabling FA3 but does not mention the new head_size <= 256 requirement that is enforced in the implementation. To keep the behavior discoverable and avoid confusion when larger head sizes silently disable FA3, please update the docstring bullet list to include the head-size constraint.

lvhan028 · 2026-02-05T02:56:52Z

May resolve the conflicts

lvhan028 · 2026-02-05T07:01:25Z

May update the support models in readme and user guide as well

lvhan028 · 2026-02-05T08:15:29Z

@zhulinJulia24 may add glm-4.7-flash into test cases

Copilot AI review requested due to automatic review settings February 4, 2026 10:53

Copilot started reviewing on behalf of RunningLeon February 4, 2026 10:53 View session

RunningLeon mentioned this pull request Feb 4, 2026

[Feature] support glm-4.7-flash ? #4283

Open

Copilot AI reviewed Feb 4, 2026

View reviewed changes

lvhan028 added the enhancement New feature or request label Feb 4, 2026

lvhan028 requested a review from CUHKSZzxy February 4, 2026 12:18

lvhan028 requested a review from grimoire February 5, 2026 02:57

glm lite moe

d7d5cf1

RunningLeon force-pushed the glm4-lite branch from 94063d9 to d7d5cf1 Compare February 5, 2026 03:50

grimoire approved these changes Feb 5, 2026

View reviewed changes

update docs

991e35b

CUHKSZzxy approved these changes Feb 5, 2026

View reviewed changes

lvhan028 merged commit 958acbd into InternLM:main Feb 5, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

support glm-4.7-flash#4320

support glm-4.7-flash#4320
lvhan028 merged 2 commits intoInternLM:mainfrom
RunningLeon:glm4-lite

RunningLeon commented Feb 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

Copilot AI Feb 4, 2026

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

RunningLeon commented Feb 4, 2026

Motivation

Modification

BC-breaking (Optional)

Use cases (Optional)

Checklist

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

lvhan028 commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants