[Draft] Muon Optimizer Support for ZeRO3#7798

Open

PKUWZP wants to merge 15 commits intodeepspeedai:masterfrom

pengdurice:peng-add-muon-stage3

Collaborator

PKUWZP commented Jan 20, 2026

Authors: @pengdurice @PKUWZP

We aim on adding Muon Optimizer to zero stage 3 in this draft PR:

Created a dedicated momentum buffer in zero stage 3 optimizer to save the momentum buffers specifically for Muon Optimizer.
The optimizer states can be dispatched into 3 devices: GPU, CPU and NVME. For GPU and CPU, we just make the new buffers the same device of self.fp32_partitioned_groups_flat; when device == NVME, we make sure that the momentum buffers can be swapped in and out along with other components in the optimizer states.
The new momentum buffers are also partitioned like self.fp32_partitioned_groups_flat to save memory footprint. So, before the muon update, we need to perform all_gather on top of each data-parallel group rank. The Muon updates of the parameters are also divided across the data-parallel ranks, and the results are all-gathered once all updates are complete. After the all_gather, the momentum buffers are partitioned and flatted again.

Next steps:

Explore quantization of momentum buffers for saving memory
Explore using highly optimized Adam / AdamW Optimizers

pengdurice added 13 commits

November 23, 2025 17:13


          initial commits: fix import error and rename the stage1and2 clss

61309a9


          save for now before rebase

7593c3e


          Merge branch 'master' into peng-add-muon-stage3

34ce997


          save for now, need to handle how to write the momentum back as well a…

0e030e5

…s non continguous version + test + everything else


          did first round of tests and it passed for offload cpu and save memor…

befedd8

…y in cpu


          save after some changes

4c51fe7


          distribute the muon update across the dp ranks

db84fd0


          distribute the muon update across the dp ranks

9c40833


          use the buffer_to_reduce as a source of truth for the gradients

537832f


          all_gather grads, not the params

b2dd46b


          remove float() and rename the variables

695d269


          fix some style issues and add back unit tests

c84bb42


          small fix

ef6dd8f

PKUWZP requested review from loadams, tjruwase and tohtana as code owners

January 20, 2026 03:49

PKUWZP commented

View reviewed changes

Collaborator Author

PKUWZP left a comment

@pengdurice Also two more comments:

It seems that we have excessive tensor allocations: Multiple torch.empty, torch.zeros, and .clone() calls create memory footprint pressure. Consider reusing buffers where possible.
Synchronous all_gather: The distributed operations could potentially be overlapped with computation.

I think we need to re-work on the PR and let's take some times to refine the code.

deepspeed/runtime/zero/stage3.py

+                              self.optimizer_swapper.swap_in_optimizer_state(parameter=self.fp32_partitioned_groups_flat[i])
+                              for idx, dest_offset in params_to_subgroup_maps[i]:
+                                  momentum_buffer[idx] = self.optimizer.state[self.fp32_partitioned_groups_flat[i]]["momentum_buffer"].narrow(0, dest_offset, param.partition_numel()).clone()
+                              self.optimizer_swapper.swap_out_optimizer_state(parameter=self.fp32_partitioned_groups_flat[i])

Collaborator Author

PKUWZP Jan 20, 2026 •

edited

Loading

@pengdurice Here is a bug. The variable param refers to the last parameter from the previous loop (for param in self.ipg_buckets[...]), not the parameter corresponding to idx. We should change it touse_muon_params[idx].partition_numel().

Collaborator

pengdurice Feb 4, 2026

yeah, good catch, fixed it.

deepspeed/runtime/zero/stage3.py

                       self.dp_process_group = self.parameter_offload.dp_process_group
                       self.sequence_parallel_size = groups._get_sequence_parallel_world_size()
-                      self.all2all_process_group = all2all_process_group

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice Question: where did we set up the all2all_process_group? It seems that it's never set.

Collaborator

pengdurice Feb 4, 2026

it is double assigned, there is an identical line nearly, so I deleted this one.

deepspeed/runtime/zero/stage3.py

+                              # params_pad = params + [torch.empty_like(params[-1])] * (world_sz - len(params) % world_sz)
+                              grads_pad = [param.grad for param in params] + [torch.empty_like(params[-1].grad)] * (world_sz - len(params) % world_sz)
+                              gathered_momentums_pad = gathered_momentums + [torch.empty_like(gathered_momentums[-1])] * (world_sz - len(gathered_momentums) % world_sz)
+                              for base_i in range(len(params))[::world_sz]:

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice There's a padding error here. When len(params) % world_sz == 0, this adds world_sz empty tensors instead of 0. Should we change it to: (world_sz - len(params) % world_sz) % world_sz ?

Collaborator

pengdurice Feb 4, 2026

good catch, fixed it.

deepspeed/runtime/zero/stage3.py Outdated


		self.reduce_scatter = reduce_scatter

		self.use_muon = 'muon' in self.optimizer.__class__.__name__.lower()

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice This is very fragile and purely depends on the naming conventions. Can we leverage isinstance() instead?

Collaborator

pengdurice Feb 4, 2026

good catch, fixed it.

deepspeed/runtime/zero/stage3.py

                       self.reduce_scatter = reduce_scatter
+                      self.use_muon = 'muon' in self.optimizer.__class__.__name__.lower()
+                      self.save_muon_momentum_buffer_in_memory = ds_config.get('save_muon_momentum_buffer_in_memory', False)

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice Can we add save_muon_momentum_buffer_in_memory to the config schema and documented?

deepspeed/runtime/zero/stage3.py

+                              params_to_subgroup_maps[i].append((idx, dest_offset))
+                              idx += 1
+                          params_size_offset += param.grad.numel()
+                      # if optimizer is swappable, swap in the momentum buffer of the parameters that need to be updated using muon and then swap them out

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice This doubles NVMe I/O overhead. Can we consider consolidating into a single swap in/out cycle?

Collaborator

pengdurice Feb 4, 2026

The thought is moun_update is also time consuming and it could be that splitting the moun_update for each subgroup may also cause overhead. It's worth evaluating which is better;-)

deepspeed/runtime/zero/stage3.py

+                          gathered_params_momentums = self._partitioned_buffers_all_gather(use_muon_params, momentum_buffer, communication_data_type)
+                          for i in params_to_subgroup_maps:
+                              if self._swappable_optimizer_subgroup(i) and not self.save_muon_momentum_buffer_in_memory:
+                                  self.optimizer_swapper.swap_in_optimizer_state(parameter=self.fp32_partitioned_groups_flat[i])

Collaborator Author

PKUWZP Jan 20, 2026

@pengdurice Again same thing here, can we consolidate the two swaps into one swap?

Collaborator

delock commented Jan 27, 2026

Hi @pengdurice @PKUWZP, I have a question. I saw there is an option that save momentum buffer in memory. Yet for adam optimizer there is no such option. Is that because for adam optimizer such needs is covered by zero offload, while for muon optimizer zero offload is not available yet, so this is used as temporary solution? Thanks!


          quick fix some issues

4ba8505

Collaborator

pengdurice commented Feb 4, 2026

Hi @pengdurice @PKUWZP, I have a question. I saw there is an option that save momentum buffer in memory. Yet for adam optimizer there is no such option. Is that because for adam optimizer such needs is covered by zero offload, while for muon optimizer zero offload is not available yet, so this is used as temporary solution? Thanks!

Hi thank you for your question, this is because for Adam optimizer, it is handled by its own code and since Adam only does element-wise operations, so no need to specially handle it. in DeepSpeed/deepspeed/runtime/zero/muon/muon_optimizer.py

                     state["exp_avg"] = torch.zeros_like(p)
                     state["exp_avg_sq"] = torch.zeros_like(p)
                     state["step"] = 0

However, for muon update, we need cross element operations and thus need to do it in this file. It is not about offload. The buffer can be in GPU or CPU depending on if it is offload or not. Hope that answers your question!


          merge with master

0a9b276

pengdurice self-requested a review

February 4, 2026 18:36

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Reviewers

tjruwase Awaiting requested review from tjruwase tjruwase is a code owner

tohtana Awaiting requested review from tohtana tohtana is a code owner

loadams Awaiting requested review from loadams loadams is a code owner

pengdurice Awaiting requested review from pengdurice

At least 1 approving review is required to merge this pull request.

Labels

None yet