fix: Ensure full gradient reduction for Muon with reduce_scatter by nathon-lee · Pull Request #7808 · deepspeedai/DeepSpeed

nathon-lee · 2026-01-23T07:59:53Z

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter

This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.

The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.

Closes #7807

nathon-lee · 2026-01-23T08:18:50Z

Hi, @tohtana @tjruwase Thanks for reviewing this PR which fixes the Muon gradient reduction issue with ZeRO-1/2 and reduce_scatter (#7807). to resolve cross-partition parameter inconsistencies. Let me know if any changes are needed. Thanks!

deepspeed/runtime/zero/stage_1_and_2.py

nathon-lee · 2026-01-24T01:28:02Z

@sfc-gh-truwase Thanks for the review and suggestion! I've updated the implementation to detect Muon usage during initialization and added an assertion to prevent incompatible configurations with reduce_scatter. Also simplified the average_tensor method using the pre-detected flag. Let me know if any further changes are needed!

deepspeed/runtime/zero/stage_1_and_2.py

nathon-lee · 2026-01-24T10:08:27Z

I've made simple formatting adjustments to comply with the project's YAPF style requirements, including:

Fixed indentation to use 4 spaces consistently
Adjusted line breaks for function parameters and long expressions
Removed trailing whitespace from all lines
Ensured consistent spacing around operators and parentheses
These changes only affect code formatting and do not alter any functionality.

sfc-gh-truwase · 2026-01-24T20:26:35Z

@nathon-lee please see this formatting: https://github.com/deepspeedai/DeepSpeed/blob/master/CONTRIBUTING.md#prerequisites

sfc-gh-truwase · 2026-01-25T17:10:52Z

deepspeed/runtime/zero/stage_1_and_2.py


-        self.low_precision_master_weights_and_grads = self.master_weights_and_grads_dtype != torch.float32
+        # Check for Muon optimizer usage
+        self.uses_muon = any(


I think it would be better to maintain this state on a per param group granularity.

Please see #7776 for the context.

Thank you for your valuable feedback! I appreciate you pointing out the need for per-parameter group tracking. I'll implement the Muon state management at the parameter group level as suggested and reference PR #7776 to ensure alignment with the project's architecture. Let me know if you need any further adjustments!

I think that is all the is needed. Please ping me when ready for review again. Thanks!

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Evoformer tests fail when we run them in parallel with other tests. ``` RuntimeError: Cannot re-initialize CUDA in forked subprocess. ``` This PR adds `@pytest.mark.sequential` to the tests. See the full test log for details: https://github.com/deepspeedai/DeepSpeed/actions/runs/21303530770/job/61326548592 Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Fix deepspeedai#7812: This PR makes DeepSpeedEngine cleanup safe for partial initialization. This prevents destructor-time tracebacks by guarding access to unitialized attributes of DeepSpeed engine. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

nathon-lee requested review from tjruwase and tohtana as code owners January 23, 2026 07:59

nathon-lee force-pushed the fix_issue_7807 branch from bc2d301 to ceb84ba Compare January 23, 2026 08:00

sfc-gh-truwase reviewed Jan 23, 2026

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

nathon-lee force-pushed the fix_issue_7807 branch from 08f4845 to 44fc221 Compare January 24, 2026 01:29

sfc-gh-truwase reviewed Jan 24, 2026

View reviewed changes

deepspeed/runtime/zero/stage_1_and_2.py Outdated Show resolved Hide resolved

nathon-lee force-pushed the fix_issue_7807 branch from 9ee47b4 to e0248ce Compare January 24, 2026 04:47

nathon-lee requested a review from loadams as a code owner January 24, 2026 04:47

sfc-gh-truwase approved these changes Jan 24, 2026

View reviewed changes

nathon-lee force-pushed the fix_issue_7807 branch from 2a0f659 to 1979f00 Compare January 24, 2026 10:01

sfc-gh-truwase reviewed Jan 25, 2026

View reviewed changes

nathon-lee and others added 8 commits January 25, 2026 21:32

fix: Ensure full gradient reduction for Muon with reduce_scatter

e5c6d1e

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Update stage_1_and_2.py

fcdc7eb

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Update stage_1_and_2.py

d87433f

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

fix: update 1 file reformatted.

ff88670

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Refactor gradient reduction logic for Muon parameters

6c06319

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

nathon-lee force-pushed the fix_issue_7807 branch from aba6a8d to 6c06319 Compare January 26, 2026 02:33

tohtana and others added 2 commits January 27, 2026 15:55

Merge branch 'master' into fix_issue_7807

2e73be8

Merge branch 'deepspeedai:master' into fix_issue_7807

8ad2f70

Copilot AI mentioned this pull request Jan 28, 2026

Fix Muon unit tests for reduce_scatter incompatibility assertion nathon-lee/DeepSpeed_woo#3

Draft

Merge branch 'master' into fix_issue_7807

19f6149

fy817 mentioned this pull request Jan 29, 2026

Fix Muon optimizer conflict with gradient clipping in ZeRO 1/2 #7776

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Ensure full gradient reduction for Muon with reduce_scatter#7808

fix: Ensure full gradient reduction for Muon with reduce_scatter#7808
nathon-lee wants to merge 11 commits intodeepspeedai:masterfrom
nathon-lee:fix_issue_7807

nathon-lee commented Jan 23, 2026

Uh oh!

nathon-lee commented Jan 23, 2026 •

edited

Loading

Uh oh!

Uh oh!

nathon-lee commented Jan 24, 2026

Uh oh!

Uh oh!

nathon-lee commented Jan 24, 2026

Uh oh!

sfc-gh-truwase commented Jan 24, 2026

Uh oh!

sfc-gh-truwase Jan 25, 2026

Uh oh!

sfc-gh-truwase Jan 25, 2026

Uh oh!

nathon-lee Jan 26, 2026

Uh oh!

sfc-gh-truwase Jan 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

nathon-lee commented Jan 23, 2026

Uh oh!

nathon-lee commented Jan 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

nathon-lee commented Jan 24, 2026

Uh oh!

Uh oh!

nathon-lee commented Jan 24, 2026

Uh oh!

sfc-gh-truwase commented Jan 24, 2026

Uh oh!

sfc-gh-truwase Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Jan 25, 2026

Choose a reason for hiding this comment

Uh oh!

nathon-lee Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

sfc-gh-truwase Jan 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

nathon-lee commented Jan 23, 2026 •

edited

Loading