Conversation
There was a problem hiding this comment.
Pull request overview
This PR adds support for the glm-4.7-flash family by wiring its Hugging Face config and architecture into the existing DeepSeek v2 implementation, and makes rotary-embedding handling compatible with newer Transformers (v5-style) configurations.
Changes:
- Introduces config-agnostic helpers
get_rope_parametersandget_rope_theta, and updates multiple models (DeepSeek v2/v3.2/MTP, Phi-3-MoE, MiniCPM3, InternLM2-VE, Llama4-vision) andbuild_rotary_embedding_from_configto use them, improving compatibility withrope_parameters/rope_thetalayouts. - Adds a
Glm4MoeLiteModelConfigBuilderand a new module-map entry so thatGlm4MoeLiteForCausalLM(GLM-4.7-Flash) is served via the DeepSeek v2 path with appropriate defaults. - Tightens FA3 selection in the CUDA attention backend by factoring in head size (
head_size <= 256) when deciding to enable FlashAttention-3.
Reviewed changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
lmdeploy/pytorch/nn/rotary_embedding.py |
Adds get_rope_parameters / get_rope_theta helpers and routes all rotary-scaling utilities and build_rotary_embedding_from_config through them to handle both legacy rope_scaling and new rope_parameters/rope_theta configs. |
lmdeploy/pytorch/models/phi3_moe.py |
Switches Phi-3-MoE’s rotary setup to use get_rope_theta and get_rope_parameters so it works with both old and new config layouts. |
lmdeploy/pytorch/models/module_map.py |
Maps Glm4MoeLiteForCausalLM to deepseek_v2.DeepseekV2ForCausalLM to run GLM-4.7-Flash via the DeepSeek v2 implementation. |
lmdeploy/pytorch/models/minicpm3.py |
Updates MiniCPM3 rotary embedding construction to fetch base and scaling via the new rope helper functions. |
lmdeploy/pytorch/models/llama4.py |
Uses get_rope_theta in Llama4VisionRotaryEmbedding so the 2D vision RoPE frequency grid honors both rope_theta and rope_parameters variants. |
lmdeploy/pytorch/models/internlm2_ve.py |
Aligns InternLM2-VE’s rotary setup with get_rope_theta/get_rope_parameters instead of directly using config.rope_scaling / config.rope_theta. |
lmdeploy/pytorch/models/deepseek_v32.py |
Makes DeepSeek-V3.2 attention and model-level rotary embedding read rope parameters via the shared helpers, improving compatibility with updated configs. |
lmdeploy/pytorch/models/deepseek_v2.py |
Updates DeepSeek v2 attention and model rotary emb construction to use get_rope_parameters/get_rope_theta, including a safer default for missing factor. |
lmdeploy/pytorch/models/deepseek_mtp.py |
Adjusts DeepSeek multi-token predictor attention and rotary embedding setup to use the new rope helpers consistently with DeepSeek v2. |
lmdeploy/pytorch/configurations/glm4.py |
Adds Glm4MoeLiteModelConfigBuilder (subclassing DeepseekV2ModelConfigBuilder) to adapt glm4_moe_lite HF configs for the DeepSeek v2 backend, with reasonable default MoE attributes. |
lmdeploy/pytorch/backends/cuda/attention/__init__.py |
Extends FA3 enablement logic to consider head_size and passes it through from the Triton attention builder when deciding whether to use FlashAttention-3. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| return dict() | ||
|
|
||
| params = FopeParameters() | ||
| rope_scaling = config.rope_scaling | ||
| rope_scaling = get_rope_parameters(config=config) | ||
| params.num_inv_freq = rope_scaling.get('fope_num_inv_freq', rope_scaling.get('num_inv_freq', params.num_inv_freq)) | ||
| params.num_key_value_heads = config.num_key_value_heads | ||
| params.fope_sep_head = rope_scaling['fope_sep_head'] |
There was a problem hiding this comment.
_get_fope_parameters still checks config.rope_scaling directly when determining whether FOPE is enabled, but then reads the actual values via get_rope_parameters. If a model provides FOPE settings through config.rope_parameters (as in newer transformers v5-style configs) without defining rope_scaling, is_fope will always be false and FOPE will never be enabled even though parameters are present. Consider using get_rope_parameters (or a unified helper) consistently in the FOPE-presence check so that both legacy rope_scaling and new rope_parameters-based configs are handled correctly.
| if rope_type == 'linear': | ||
| emb_type = RopeType.LinearScaling | ||
| if rope_type == 'dynamic': | ||
| emb_type = RopeType.DynamicNTKScaling | ||
| else: | ||
| raise RuntimeError(f'Unsupported rope type: {rope_type}') |
There was a problem hiding this comment.
The rope type selection logic uses two separate if statements with the else attached only to the second one, so a valid rope_type == 'linear' will still fall through to the else and raise RuntimeError('Unsupported rope type: linear'). To avoid incorrectly rejecting supported types, the second branch should be elif rope_type == 'dynamic': (or equivalent mutually exclusive logic) so that only truly unknown rope types hit the error path.
| @@ -44,7 +44,7 @@ def _enable_fa3(alibi: bool, learnable_sink: bool, block_sparse_size: int): | |||
| Returns: | |||
| True if FA3 should be enabled, False otherwise. | |||
| """ | |||
| enable = not alibi and not learnable_sink and block_sparse_size == 1 | |||
| enable = not alibi and not learnable_sink and block_sparse_size == 1 and head_size <= 256 | |||
There was a problem hiding this comment.
The documentation for _enable_fa3 lists the conditions for enabling FA3 but does not mention the new head_size <= 256 requirement that is enforced in the implementation. To keep the behavior discoverable and avoid confusion when larger head sizes silently disable FA3, please update the docstring bullet list to include the head-size constraint.
|
May resolve the conflicts |
94063d9 to
d7d5cf1
Compare
|
May update the support models in readme and user guide as well |
|
@zhulinJulia24 may add glm-4.7-flash into test cases |
Motivation
support zai-org/GLM-4.7-Flash
Required Issue #4283
After PR #4303
Modification
Please briefly describe what modification is made in this PR.
BC-breaking (Optional)
Does the modification introduce changes that break the backward-compatibility of the downstream repositories?
If so, please describe how it breaks the compatibility and how the downstream projects should modify their code to keep compatibility with this PR.
Use cases (Optional)
If this PR introduces a new feature, it is better to list some use cases here, and update the documentation.
Checklist