-
Notifications
You must be signed in to change notification settings - Fork 242
Description
Summary
We repeatedly hit a hard crash in NeMo RL GRPO training using DTensorPolicyWorker (HF/Transformers + FSDP/DTensor) with vLLM generation.
The crash occurs inside torch.optim.AdamW under DTensor dispatch:
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
In practice this appears after an optimizer offload/refit/checkpoint boundary: optimizer state ends up on CPU, while params/grads are on CUDA.
Environment
- NeMo RL: source checkout from
main(we were on commitdacac7e0...locally; repo has advanced beyond v0.5.0) - torch: 2.9.1+cu128 (CUDA 12.8)
- transformers: 4.57.6
- ray: 2.53.0
- vllm: 0.12.0
- GPU: 2× Blackwell 96GB (Ray actor GPU mapping means the failing actor sees its device as
cuda:0even when it is physical GPU1)
Repro (high level)
- GRPO (async) with non-colocated generation (vLLM on a different Ray worker group)
- policy worker:
DTensorPolicyWorker - optimizer: AdamW
- training proceeds for a while, then crashes inside
DTensorPolicyWorker.train()duringoptimizer.step().
Crash trace (excerpt)
From our log:
ray::DTensorPolicyWorker.train()
File .../nemo_rl/models/policy/workers/dtensor_policy_worker.py, line 858, in train
...
File .../torch/optim/adam.py, line 247, in step
File .../torch/optim/adam.py, line 452, in _single_tensor_adam
... torch.distributed.tensor ... __torch_dispatch__
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
Suspected root cause
DTensorPolicyWorker.prepare_for_training() only reloads optimizer state to CUDA when:
offload_optimizer_for_logprobis enabled OR- generation is colocated.
But NeMo RL can offload optimizer state to CPU during refit/checkpointing to free VRAM, and in the non-colocated path we can re-enter training without moving optimizer state back to CUDA, leading to the AdamW DTensor device mismatch.
Proposed fix
Make prepare_for_training() always move optimizer state back to CUDA when training resumes (no-op if already on CUDA).
Minimal patch we applied locally:
--- a/nemo_rl/models/policy/workers/dtensor_policy_worker.py
+++ b/nemo_rl/models/policy/workers/dtensor_policy_worker.py
@@
- if (
- self.optimizer is not None
- and not self.cpu_offload
- and (self.offload_optimizer_for_logprob or self.is_generation_colocated)
- ):
- self.move_optimizer_to_device("cuda")
+ if self.optimizer is not None and not self.cpu_offload:
+ self.move_optimizer_to_device("cuda")Same change for dtensor_policy_worker_v2.py.
Questions
- Is optimizer offload expected to happen in the non-colocated GRPO flow? If yes, should the DTensor worker always reload optimizer state before stepping?
- Would you accept a PR with the above change (plus a regression test if you have a suitable harness)?