Fix loss scaling and backward call of ZenFlow by tohtana · Pull Request #7793 · deepspeedai/DeepSpeed

tohtana · 2026-01-18T06:22:27Z

@Antlera Fix loss scaling and backward call for ZeRO3 of #7771 (not merged).

The previous change incorrectly modified the loss scaling to apply to gas_scaled_loss instead of loss, but the subsequent backward() call still uses the loss variable. This caused FP16 tests to fail because the loss scaling was never applied to the tensor being backward'd. This fix restores the correct behavior: loss scaling is applied to the loss variable which is then used in backward(). Fixes failing tests: - test_with_autocast.py::test_parameters_match_ddp_after_step[z2_fp16_master_wg_autocast] - test_with_autocast.py::test_parameters_match_ddp_after_step[z3_fp16_master_wg_autocast] - test_zero_autocast.py::TestZeroAutoCast::test[dtype1-*] - test_zero_autocast.py::TestZeroAutoCast::test_safe_modules_conf[dtype1-*] Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

When ZenFlow is enabled, engine.backward() calls optimizer.backward(). Stage 1/2 with ZenFlow uses ZenFlowZeroOptimizer which has its own backward() method. However, Stage 3 relies on the base ZeROOptimizer.backward() which expects backward_prologue(loss) to accept and return loss. Stage 3's backward_prologue() takes no arguments and returns nothing, causing a TypeError when called via the base class backward() method. This fix adds a proper backward() method to DeepSpeedZeroOptimizer_Stage3 that handles the ZenFlow backward pass correctly, similar to how ZenFlowZeroOptimizer does it for Stage 1/2. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

Antlera · 2026-01-19T23:46:39Z

deepspeed/runtime/zero/stage3.py

        if self.swap_optimizer:
            self.optimizer_swapper.post_backward()

+    def backward(self, loss, retain_graph=False):


It looks like this interface is now designed for ZenFlow-like methods, which makes the integration easier and cleaner.
I'm not sure if DeepSpeed × PyTorch prefers keeping this fully PyTorch-aligned instead of adding framework-specific logic. Shall we add a TODO here?

Otherwise, this code LGTM. Thanks!

That's a good point. I wanted to quickly resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - #7786 - #7788 - #7789 - #7790 - #7793 - #7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>

We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>

tohtana added 2 commits January 17, 2026 21:44

tohtana requested a review from tjruwase as a code owner January 18, 2026 06:22

tohtana changed the title ~~Tohtana/fix zenfulow fp16 loss scaling~~ Fix loss scaling and backward call of ZenFlow Jan 18, 2026

tohtana mentioned this pull request Jan 18, 2026

Fix: ZenFlow Adam integration for updated PyTorch backward flow (#7759) #7771

Open

tohtana requested review from Antlera and removed request for tjruwase January 18, 2026 06:30

tohtana mentioned this pull request Jan 18, 2026

Add full test suite workflow #7795

Merged

Antlera reviewed Jan 19, 2026

View reviewed changes

Antlera mentioned this pull request Jan 20, 2026

[BUG] ZenFlow Stage 3 with full_warm_up_rounds=0 fails due to missing complete_column_offset attribute #7796

Open

tohtana closed this Jan 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix loss scaling and backward call of ZenFlow#7793

Fix loss scaling and backward call of ZenFlow#7793
tohtana wants to merge 2 commits intodeepspeedai:tingfeng/zenflow_fix_backwardfrom
tohtana:tohtana/fix_zenfulow_fp16_loss_scaling

tohtana commented Jan 18, 2026 •

edited

Loading

Uh oh!

Antlera Jan 19, 2026

Uh oh!

Antlera Jan 19, 2026

Uh oh!

tohtana Jan 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

tohtana commented Jan 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Antlera Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

Antlera Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

tohtana Jan 20, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

tohtana commented Jan 18, 2026 •

edited

Loading