DTensorPolicyWorker AdamW device mismatch (cpu vs cuda) after optimizer offload in non-colocated GRPO

## Summary
We repeatedly hit a hard crash in NeMo RL GRPO training using `DTensorPolicyWorker` (HF/Transformers + FSDP/DTensor) with vLLM generation.

The crash occurs inside `torch.optim.AdamW` under DTensor dispatch:

```
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```

In practice this appears after an optimizer offload/refit/checkpoint boundary: optimizer state ends up on CPU, while params/grads are on CUDA.

## Environment
- NeMo RL: source checkout from `main` (we were on commit `dacac7e0...` locally; repo has advanced beyond v0.5.0)
- torch: 2.9.1+cu128 (CUDA 12.8)
- transformers: 4.57.6
- ray: 2.53.0
- vllm: 0.12.0
- GPU: 2× Blackwell 96GB (Ray actor GPU mapping means the failing actor sees its device as `cuda:0` even when it is physical GPU1)

## Repro (high level)
- GRPO (async) with non-colocated generation (vLLM on a different Ray worker group)
- policy worker: `DTensorPolicyWorker`
- optimizer: AdamW
- training proceeds for a while, then crashes inside `DTensorPolicyWorker.train()` during `optimizer.step()`.

## Crash trace (excerpt)
From our log:

```
ray::DTensorPolicyWorker.train()
  File .../nemo_rl/models/policy/workers/dtensor_policy_worker.py, line 858, in train
  ...
  File .../torch/optim/adam.py, line 247, in step
  File .../torch/optim/adam.py, line 452, in _single_tensor_adam
  ... torch.distributed.tensor ... __torch_dispatch__
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cpu!
```

## Suspected root cause
`DTensorPolicyWorker.prepare_for_training()` only reloads optimizer state to CUDA when:
- `offload_optimizer_for_logprob` is enabled OR
- generation is colocated.

But NeMo RL can offload optimizer state to CPU during refit/checkpointing to free VRAM, and in the non-colocated path we can re-enter training without moving optimizer state back to CUDA, leading to the AdamW DTensor device mismatch.

## Proposed fix
Make `prepare_for_training()` always move optimizer state back to CUDA when training resumes (no-op if already on CUDA).

Minimal patch we applied locally:

```diff
--- a/nemo_rl/models/policy/workers/dtensor_policy_worker.py
+++ b/nemo_rl/models/policy/workers/dtensor_policy_worker.py
@@
-        if (
-            self.optimizer is not None
-            and not self.cpu_offload
-            and (self.offload_optimizer_for_logprob or self.is_generation_colocated)
-        ):
-            self.move_optimizer_to_device("cuda")
+        if self.optimizer is not None and not self.cpu_offload:
+            self.move_optimizer_to_device("cuda")
```

Same change for `dtensor_policy_worker_v2.py`.

## Questions
- Is optimizer offload expected to happen in the non-colocated GRPO flow? If yes, should the DTensor worker always reload optimizer state before stepping?
- Would you accept a PR with the above change (plus a regression test if you have a suitable harness)?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DTensorPolicyWorker AdamW device mismatch (cpu vs cuda) after optimizer offload in non-colocated GRPO #1869

Summary

Environment

Repro (high level)

Crash trace (excerpt)

Suspected root cause

Proposed fix

Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

DTensorPolicyWorker AdamW device mismatch (cpu vs cuda) after optimizer offload in non-colocated GRPO #1869

Description

Summary

Environment

Repro (high level)

Crash trace (excerpt)

Suspected root cause

Proposed fix

Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions