Skip to content

DTensorPolicyWorker AdamW device mismatch (cpu vs cuda) after optimizer offload in non-colocated GRPO #1869

@banyan-god

Description

@banyan-god

Summary

We repeatedly hit a hard crash in NeMo RL GRPO training using DTensorPolicyWorker (HF/Transformers + FSDP/DTensor) with vLLM generation.

The crash occurs inside torch.optim.AdamW under DTensor dispatch:

RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

In practice this appears after an optimizer offload/refit/checkpoint boundary: optimizer state ends up on CPU, while params/grads are on CUDA.

Environment

  • NeMo RL: source checkout from main (we were on commit dacac7e0... locally; repo has advanced beyond v0.5.0)
  • torch: 2.9.1+cu128 (CUDA 12.8)
  • transformers: 4.57.6
  • ray: 2.53.0
  • vllm: 0.12.0
  • GPU: 2× Blackwell 96GB (Ray actor GPU mapping means the failing actor sees its device as cuda:0 even when it is physical GPU1)

Repro (high level)

  • GRPO (async) with non-colocated generation (vLLM on a different Ray worker group)
  • policy worker: DTensorPolicyWorker
  • optimizer: AdamW
  • training proceeds for a while, then crashes inside DTensorPolicyWorker.train() during optimizer.step().

Crash trace (excerpt)

From our log:

ray::DTensorPolicyWorker.train()
  File .../nemo_rl/models/policy/workers/dtensor_policy_worker.py, line 858, in train
  ...
  File .../torch/optim/adam.py, line 247, in step
  File .../torch/optim/adam.py, line 452, in _single_tensor_adam
  ... torch.distributed.tensor ... __torch_dispatch__
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!

Suspected root cause

DTensorPolicyWorker.prepare_for_training() only reloads optimizer state to CUDA when:

  • offload_optimizer_for_logprob is enabled OR
  • generation is colocated.

But NeMo RL can offload optimizer state to CPU during refit/checkpointing to free VRAM, and in the non-colocated path we can re-enter training without moving optimizer state back to CUDA, leading to the AdamW DTensor device mismatch.

Proposed fix

Make prepare_for_training() always move optimizer state back to CUDA when training resumes (no-op if already on CUDA).

Minimal patch we applied locally:

--- a/nemo_rl/models/policy/workers/dtensor_policy_worker.py
+++ b/nemo_rl/models/policy/workers/dtensor_policy_worker.py
@@
-        if (
-            self.optimizer is not None
-            and not self.cpu_offload
-            and (self.offload_optimizer_for_logprob or self.is_generation_colocated)
-        ):
-            self.move_optimizer_to_device("cuda")
+        if self.optimizer is not None and not self.cpu_offload:
+            self.move_optimizer_to_device("cuda")

Same change for dtensor_policy_worker_v2.py.

Questions

  • Is optimizer offload expected to happen in the non-colocated GRPO flow? If yes, should the DTensor worker always reload optimizer state before stepping?
  • Would you accept a PR with the above change (plus a regression test if you have a suitable harness)?

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions