Skip to content

[RLLib] 'evaluation_parallel_to_training' does not work properly in new API stack #56271

@jbedorf

Description

@jbedorf

What happened + What you expected to happen

The 'evaluation_parallel_to_training' feature does not work properly when the 'step' call of an environment is taking a long time. This leads to sequential stepping of train -> eval -> train -> eval rather than paralell steps [train,eval] -> [train,eval].

This method still works properly on the old-api-stack.

The below reproduction script shows this behavior. The script is based on the evaluation_parallel_to_training.py example script, but the environment is replaced to a simple example that sleeps for 5 seconds in the step call. When running this for 4 steps we get the following outputs:

New API Stack

+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |   return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 |      4 |          40.0991 |    4 |                 1 |           1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+

Old API Stack

+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 |      4 |          25.0681 |    4 |                 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+

We can see that:

  • New API: 40 seconds (matches: 5 seconds * 4 iterations * 2 (train/eval sequential)
  • old API : 25 seconds (matches: 5 seconds * 4 iterations * 1 (train/eval parallel) + something.

Doing some digging through the stack leads to the following function that hangs until the 'train' step is complete before the 'eval' step gets executed. https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646

The below script reproduces the problem, it can be run as-is against the latest release.

Versions / Dependencies

Ray 2.49.1 (and earlier).
Tested on Linux / Python 3.12

Reproduction script

"""Example showing how one can set up evaluation running in parallel to training.

Reproduction example of evaluation in parallel not working on the new api stack

When running with the new API stack:

RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4  \
    --evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
    --num-env-runners 1 --evaluation-duration-unit "timesteps"  --evaluation-interval 1 \
    --num-gpus-per-learner 0

+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |   return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 |      4 |          40.0991 |    4 |                 1 |           1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+

When running with the old API stack

RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4  \
    --evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
    --num-env-runners 1 --evaluation-duration-unit "timesteps"  --evaluation-interval 1 \
    --num-gpus-per-learner 0 --old-api-stack

Number of trials: 1/1 (1 TERMINATED)
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name          | status     | loc                  |   iter |   total time (s) |   ts |   combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 |      4 |          25.0681 |    4 |                 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+

Code appears to hang in this ray.get call:
https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646
"""

import time
from datetime import datetime
from typing import Optional, Union

import gymnasium as gym
import numpy as np
from ray.rllib.env.multi_agent_env import make_multi_agent
from ray.rllib.utils.test_utils import (
    add_rllib_example_script_args,
    run_rllib_example_script_experiment,
)
from ray.tune.registry import get_trainable_cls, register_env
from ray.tune.result import TRAINING_ITERATION

parser = add_rllib_example_script_args(
    default_timesteps=20000000,
    default_reward=500000.0,
)
parser.set_defaults(
    evaluation_num_env_runners=2,
    evaluation_interval=1,
)


class MyTestEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
    def __init__(self, config):
        print("INIT called: ", config)
        high = np.array([np.inf, np.inf, np.inf, np.inf], dtype=np.float32)
        self.action_space = gym.spaces.Discrete(2)
        self.observation_space = gym.spaces.Box(-high, high, dtype=np.float32)
        self.step_count = 0
        self.mode = config.get("mode", "TRAIN")

    def step(self, action):
        print(f"STEP - {self.step_count} - {self.mode} ! {datetime.now()}")
        time.sleep(5)
        self.step_count += 1
        return self.observation_space.sample(), 1, True, False, {}

    def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
        super().reset(seed=seed)
        print(f"RESET - {self.step_count} - {self.mode} ! {datetime.now()}")
        return self.observation_space.sample(), {}


MultiAgentMyTestEnv = make_multi_agent(lambda c: MyTestEnv(c))


if __name__ == "__main__":
    args = parser.parse_args()

    # Register our environment with tune.
    if args.num_agents > 0:
        register_env("env", lambda c: MultiAgentMyTestEnv(c))
    else:
        raise ValueError("Test run requires at least 1 agent.")

    base_config = (
        get_trainable_cls(args.algo)
        .get_default_config()
        .environment("env" if args.num_agents > 0 else "CartPole-v1")
        .env_runners(create_env_on_local_worker=True)
        .evaluation(
            evaluation_parallel_to_training=args.evaluation_parallel_to_training,
            evaluation_num_env_runners=args.evaluation_num_env_runners,
            evaluation_interval=args.evaluation_interval,
            evaluation_duration=args.evaluation_duration,
            evaluation_duration_unit=args.evaluation_duration_unit,
            evaluation_config={
                "explore": False,
                "metrics_num_episodes_for_smoothing": 1,
                "evaluation_duration": 1,
                "env_config": {"mode": "eval"},
            },
        )
    )

    # Add a simple multi-agent setup.
    if args.num_agents > 0:
        base_config.multi_agent(
            policies={f"p{i}" for i in range(args.num_agents)},
            policy_mapping_fn=lambda aid, *a, **kw: f"p{aid}",
        )
    # Set some PPO-specific tuning settings to learn better in the env (assumed to be
    # CartPole-v1).
    if args.algo == "PPO":
        base_config.training(
            lr=0.0003, num_epochs=1, vf_loss_coeff=0.01, minibatch_size=1, train_batch_size=1
        )

    run_rllib_example_script_experiment(
        base_config,
        args,
        stop={TRAINING_ITERATION: args.stop_iters}
    )

Issue Severity

High: It blocks me from completing my task.

Metadata

Metadata

Labels

P1Issue that should be fixed within a few weeksbugSomething that is supposed to be working; but isn'tcommunity-backlogperformancerllibRLlib related issues

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions