Fix loss scaling and backward call of ZenFlow#7793
Closed
tohtana wants to merge 2 commits intodeepspeedai:tingfeng/zenflow_fix_backwardfrom
Closed
Fix loss scaling and backward call of ZenFlow#7793tohtana wants to merge 2 commits intodeepspeedai:tingfeng/zenflow_fix_backwardfrom
tohtana wants to merge 2 commits intodeepspeedai:tingfeng/zenflow_fix_backwardfrom
Conversation
The previous change incorrectly modified the loss scaling to apply to gas_scaled_loss instead of loss, but the subsequent backward() call still uses the loss variable. This caused FP16 tests to fail because the loss scaling was never applied to the tensor being backward'd. This fix restores the correct behavior: loss scaling is applied to the loss variable which is then used in backward(). Fixes failing tests: - test_with_autocast.py::test_parameters_match_ddp_after_step[z2_fp16_master_wg_autocast] - test_with_autocast.py::test_parameters_match_ddp_after_step[z3_fp16_master_wg_autocast] - test_zero_autocast.py::TestZeroAutoCast::test[dtype1-*] - test_zero_autocast.py::TestZeroAutoCast::test_safe_modules_conf[dtype1-*] Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
When ZenFlow is enabled, engine.backward() calls optimizer.backward(). Stage 1/2 with ZenFlow uses ZenFlowZeroOptimizer which has its own backward() method. However, Stage 3 relies on the base ZeROOptimizer.backward() which expects backward_prologue(loss) to accept and return loss. Stage 3's backward_prologue() takes no arguments and returns nothing, causing a TypeError when called via the base class backward() method. This fix adds a proper backward() method to DeepSpeedZeroOptimizer_Stage3 that handles the ZenFlow backward pass correctly, similar to how ZenFlowZeroOptimizer does it for Stage 1/2. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Antlera
reviewed
Jan 19, 2026
| if self.swap_optimizer: | ||
| self.optimizer_swapper.post_backward() | ||
|
|
||
| def backward(self, loss, retain_graph=False): |
Collaborator
There was a problem hiding this comment.
It looks like this interface is now designed for ZenFlow-like methods, which makes the integration easier and cleaner.
I'm not sure if DeepSpeed × PyTorch prefers keeping this fully PyTorch-aligned instead of adding framework-specific logic. Shall we add a TODO here?
Collaborator
There was a problem hiding this comment.
Otherwise, this code LGTM. Thanks!
Collaborator
Author
There was a problem hiding this comment.
That's a good point. I wanted to quickly resolve this so we can pass the full CI test suite, but it makes the core part of the optimizer more Zenflow-specific. We probably shouldn’t cut corners for this. Let me close this for now, and then consider a more general approach.
tohtana
added a commit
that referenced
this pull request
Jan 20, 2026
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - #7786 - #7788 - #7789 - #7790 - #7793 - #7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
phalani-paladugu
pushed a commit
to phalani-paladugu/DeepSpeed
that referenced
this pull request
Jan 29, 2026
We have been disabled the full unit test workflow for a while. This PR migrates the full test to our AWS test infra. To make the tests pass, we need to merge these PRs: - deepspeedai#7786 - deepspeedai#7788 - deepspeedai#7789 - deepspeedai#7790 - deepspeedai#7793 - deepspeedai#7794 In addition having these PRs merged, this PR has the following changes in the full test workflow and test harness: - Ignore flags for some known issues: - nvme: Requires an actual NVMe device. Our CI currently doesn't have NVMe storage configured - GDS: GDS requires special kernel drivers and NVIDIA Magnum IO to enable direct GPU-to-storage transfers. CI instances don't have this configured. - Zenflow: 1. Stage 3 bugs: The ZenFlow + ZeRO Stage 3 implementation has pre-existing bugs that cause internal pytest errors and worker crashes, 2. CUDA/fork incompatibility: test_zf_torch_adam.py uses torch.optim.AdamW which does CUDA graph capture checks that fail in forked processes (--forked flag, we can just move it to sequential tests) - `/mnt/aio` mount for async I/O tests - CUTLASS installation for Evoformer tests - Add `DS_DISABLE_REUSE_DIST_ENV` to the test harness to prevent worker cleanup hangs Once we merge this PR, we will be able to run the full test manually or at scheduled times. --------- Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: Phalani Paladugu <mailofphalani@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@Antlera Fix loss scaling and backward call for ZeRO3 of #7771 (not merged).