fix: Ensure full gradient reduction for Muon with reduce_scatter#7808
fix: Ensure full gradient reduction for Muon with reduce_scatter#7808nathon-lee wants to merge 11 commits intodeepspeedai:masterfrom
Conversation
bc2d301 to
ceb84ba
Compare
|
@sfc-gh-truwase Thanks for the review and suggestion! I've updated the implementation to detect Muon usage during initialization and added an assertion to prevent incompatible configurations with reduce_scatter. Also simplified the average_tensor method using the pre-detected flag. Let me know if any further changes are needed! |
08f4845 to
44fc221
Compare
9ee47b4 to
e0248ce
Compare
2a0f659 to
1979f00
Compare
|
I've made simple formatting adjustments to comply with the project's YAPF style requirements, including: Fixed indentation to use 4 spaces consistently |
|
@nathon-lee please see this formatting: https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites |
|
|
||
| self.low_precision_master_weights_and_grads = self.master_weights_and_grads_dtype != torch.float32 | ||
| # Check for Muon optimizer usage | ||
| self.uses_muon = any( |
There was a problem hiding this comment.
I think it would be better to maintain this state on a per param group granularity.
There was a problem hiding this comment.
Thank you for your valuable feedback! I appreciate you pointing out the need for per-parameter group tracking. I'll implement the Muon state management at the parameter group level as suggested and reference PR #7776 to ensure alignment with the project's architecture. Let me know if you need any further adjustments!
There was a problem hiding this comment.
I think that is all the is needed. Please ping me when ready for review again. Thanks!
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Evoformer tests fail when we run them in parallel with other tests. ``` RuntimeError: Cannot re-initialize CUDA in forked subprocess. ``` This PR adds `@pytest.mark.sequential` to the tests. See the full test log for details: https://github.com/deepspeedai/DeepSpeed/actions/runs/21303530770/job/61326548592 Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Fix deepspeedai#7812: This PR makes DeepSpeedEngine cleanup safe for partial initialization. This prevents destructor-time tracebacks by guarding access to unitialized attributes of DeepSpeed engine. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
aba6a8d to
6c06319
Compare
fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter
This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.
The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.
Closes #7807