[RLLib] 'evaluation_parallel_to_training' does not work properly in new API stack

### What happened + What you expected to happen

The 'evaluation_parallel_to_training' feature does not work properly when the 'step' call of an environment is taking a long time. This leads to sequential stepping of train -> eval -> train -> eval rather than paralell steps [train,eval] -> [train,eval]. 

This method still works properly on the old-api-stack. 

The below reproduction script shows this behavior. The script is based on the evaluation_parallel_to_training.py example script, but the environment is replaced to a simple example that sleeps for 5 seconds in the step call. When running this for 4 steps we get the following outputs:

**New API Stack**
```
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |   return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 |      4 |          40.0991 |    4 |                 1 |           1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
```

**Old API Stack**
```
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 |      4 |          25.0681 |    4 |                 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
```

We can see that:
* New API: 40 seconds (matches: 5 seconds * 4 iterations * 2 (train/eval sequential)
* old API : 25 seconds (matches: 5 seconds * 4 iterations * 1 (train/eval parallel) + something.

Doing some digging through the stack leads to the following function that hangs until the 'train' step is complete before the 'eval' step gets executed.  https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646


The below script reproduces the problem, it can be run as-is against the latest release.

### Versions / Dependencies

Ray 2.49.1  (and earlier).
Tested on Linux / Python 3.12

### Reproduction script

```python
"""Example showing how one can set up evaluation running in parallel to training.

Reproduction example of evaluation in parallel not working on the new api stack

When running with the new API stack:

RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4  \
    --evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
    --num-env-runners 1 --evaluation-duration-unit "timesteps"  --evaluation-interval 1 \
    --num-gpus-per-learner 0

+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |   return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 |      4 |          40.0991 |    4 |                 1 |           1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+

When running with the old API stack

RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4  \
    --evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
    --num-env-runners 1 --evaluation-duration-unit "timesteps"  --evaluation-interval 1 \
    --num-gpus-per-learner 0 --old-api-stack

Number of trials: 1/1 (1 TERMINATED)
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 |      4 |          25.0681 |    4 |                 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+

Code appears to hang in this ray.get call:
https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646
"""

import time
from datetime import datetime
from typing import Optional, Union

import gymnasium as gym
import numpy as np
from ray.rllib.env.multi_agent_env import make_multi_agent
from ray.rllib.utils.test_utils import (
    add_rllib_example_script_args,
    run_rllib_example_script_experiment,
)
from ray.tune.registry import get_trainable_cls, register_env
from ray.tune.result import TRAINING_ITERATION

parser = add_rllib_example_script_args(
    default_timesteps=20000000,
    default_reward=500000.0,
)
parser.set_defaults(
    evaluation_num_env_runners=2,
    evaluation_interval=1,
)


class MyTestEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
    def __init__(self, config):
        print("INIT called: ", config)
        high = np.array([np.inf, np.inf, np.inf, np.inf], dtype=np.float32)
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(-high, high, dtype=np.float32)
        self.step_count = 0
        self.mode = config.get("mode", "TRAIN")

    def step(self, action):
        print(f"STEP - {self.step_count} - {self.mode} ! {datetime.now()}")
        time.sleep(5)
        self.step_count += 1
        return self.observation_space.sample(), 1, True, False, {}

    def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        print(f"RESET - {self.step_count} - {self.mode} ! {datetime.now()}")
        return self.observation_space.sample(), {}


MultiAgentMyTestEnv = make_multi_agent(lambda c: MyTestEnv(c))


if __name__ == "__main__":
    args = parser.parse_args()

    # Register our environment with tune.
    if args.num_agents > 0:
        register_env("env", lambda c: MultiAgentMyTestEnv(c))
    else:
        raise ValueError("Test run requires at least 1 agent.")

    base_config = (
        get_trainable_cls(args.algo)
        .get_default_config()
        .environment("env" if args.num_agents > 0 else "CartPole-v1")
        .env_runners(create_env_on_local_worker=True)
        .evaluation(
            evaluation_parallel_to_training=args.evaluation_parallel_to_training,
            evaluation_num_env_runners=args.evaluation_num_env_runners,
            evaluation_interval=args.evaluation_interval,
            evaluation_duration=args.evaluation_duration,
            evaluation_duration_unit=args.evaluation_duration_unit,
            evaluation_config={
                "explore": False,
                "metrics_num_episodes_for_smoothing": 1,
                "evaluation_duration": 1,
                "env_config": {"mode": "eval"},
            },
        )
    )

    # Add a simple multi-agent setup.
    if args.num_agents > 0:
        base_config.multi_agent(
            policies={f"p{i}" for i in range(args.num_agents)},
            policy_mapping_fn=lambda aid, *a, **kw: f"p{aid}",
        )
    # Set some PPO-specific tuning settings to learn better in the env (assumed to be
    # CartPole-v1).
    if args.algo == "PPO":
        base_config.training(
            lr=0.0003, num_epochs=1, vf_loss_coeff=0.01, minibatch_size=1, train_batch_size=1
        )

    run_rllib_example_script_experiment(
        base_config,
        args,
        stop={TRAINING_ITERATION: args.stop_iters}
    )
```

### Issue Severity

High: It blocks me from completing my task.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RLLib] 'evaluation_parallel_to_training' does not work properly in new API stack #56271

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[RLLib] 'evaluation_parallel_to_training' does not work properly in new API stack #56271

Description

What happened + What you expected to happen

Versions / Dependencies

Reproduction script

Issue Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions