Fix loss scaling and backward call of ZenFlow #7793

Antlera · 2026-01-19T23:46:39Z

It looks like this interface is now designed for ZenFlow-like methods, which makes the integration easier and cleaner.
I'm not sure if DeepSpeed × PyTorch prefers keeping this fully PyTorch-aligned instead of adding framework-specific logic. Shall we add a TODO here?

Otherwise, this code LGTM. Thanks!

That's a good point. I wanted to quickly resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.

-Original file line number
+Diff line change
@@ Expand Up @@
             # TODO: handle these scaling with direct calls to loss.backward()
             if isinstance(self.optimizer, ZeROOptimizer):
-                gas_scaled_loss = self.optimizer.scale_if_loss(gas_scaled_loss)
+                loss = self.optimizer.scale_if_loss(loss)
             elif self.torch_autocast_z0_gradscaler:
-                gas_scaled_loss = self.torch_autocast_z0_gradscaler.scale(gas_scaled_loss)
+                loss = self.torch_autocast_z0_gradscaler.scale(loss)
             with compiled_autograd(self._is_compiled_autograd_enabled, self._compile_kwargs):
                 # ZenFlow requires exclusive control over the backward pass to manage its
@@ Expand Down @@

-Original file line number
+Diff line change
@@ Expand Up / @@ -2376,6 +2376,25 @@ def backward_epilogue(self): @@
             if self.swap_optimizer:
                 self.optimizer_swapper.post_backward()
+        def backward(self, loss, retain_graph=False):
+            """
+            Backward pass for Stage 3 optimizer.
+            When ZenFlow is enabled, this method handles the specialized backward
+            pass with loss scaling and ZenFlow's prologue/epilogue hooks.
+            """
+            self.backward_prologue()
+            self.enter_backward()
+            if self.custom_loss_scaler:
+                scaled_loss = self.external_loss_scale * loss
+                scaled_loss.backward(retain_graph=retain_graph)
+            else:
+                self.loss_scaler.backward(loss.float(), retain_graph=retain_graph)
+            self.backward_epilogue()
+            self.exit_backward()
         def get_fp32_grad_partitions(self) -> Dict[int, Dict[int, Tensor]]:
             """get fp32 gradient partition dictionary
             accessed as grad_dict[parameter_group_index][parameter_index]
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loss scaling and backward call of ZenFlow #7793

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Antlera Jan 19, 2026

Uh oh!

Antlera Jan 19, 2026

Uh oh!

tohtana Jan 20, 2026

Uh oh!

Fix loss scaling and backward call of ZenFlow #7793

Uh oh!

Fix loss scaling and backward call of ZenFlow #7793

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Antlera Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Antlera Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!