-
Notifications
You must be signed in to change notification settings - Fork 7.2k
Description
What happened + What you expected to happen
The 'evaluation_parallel_to_training' feature does not work properly when the 'step' call of an environment is taking a long time. This leads to sequential stepping of train -> eval -> train -> eval rather than paralell steps [train,eval] -> [train,eval].
This method still works properly on the old-api-stack.
The below reproduction script shows this behavior. The script is based on the evaluation_parallel_to_training.py example script, but the environment is replaced to a simple example that sleeps for 5 seconds in the step call. When running this for 4 steps we get the following outputs:
New API Stack
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name | status | loc | iter | total time (s) | ts | combined return | return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 | 4 | 40.0991 | 4 | 1 | 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
Old API Stack
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name | status | loc | iter | total time (s) | ts | combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 | 4 | 25.0681 | 4 | 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
We can see that:
- New API: 40 seconds (matches: 5 seconds * 4 iterations * 2 (train/eval sequential)
- old API : 25 seconds (matches: 5 seconds * 4 iterations * 1 (train/eval parallel) + something.
Doing some digging through the stack leads to the following function that hangs until the 'train' step is complete before the 'eval' step gets executed. https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646
The below script reproduces the problem, it can be run as-is against the latest release.
Versions / Dependencies
Ray 2.49.1 (and earlier).
Tested on Linux / Python 3.12
Reproduction script
"""Example showing how one can set up evaluation running in parallel to training.
Reproduction example of evaluation in parallel not working on the new api stack
When running with the new API stack:
RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4 \
--evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
--num-env-runners 1 --evaluation-duration-unit "timesteps" --evaluation-interval 1 \
--num-gpus-per-learner 0
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
| Trial name | status | loc | iter | total time (s) | ts | combined return | return p0 |
|---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------|
| PPO_env_1a881_00000 | TERMINATED | 172.21.137.227:13070 | 4 | 40.0991 | 4 | 1 | 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+-------------+
When running with the old API stack
RAY_DEDUP_LOGS=0 python evaluation_parallel_to_training.py --num-agents 1 --stop-iters=4 \
--evaluation-parallel-to-training --evaluation-num-env-runners=1 --evaluation-duration=1 \
--num-env-runners 1 --evaluation-duration-unit "timesteps" --evaluation-interval 1 \
--num-gpus-per-learner 0 --old-api-stack
Number of trials: 1/1 (1 TERMINATED)
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
| Trial name | status | loc | iter | total time (s) | ts | combined return |
|---------------------+------------+----------------------+--------+------------------+------+-------------------|
| PPO_env_537d5_00000 | TERMINATED | 172.21.137.227:19015 | 4 | 25.0681 | 4 | 1 |
+---------------------+------------+----------------------+--------+------------------+------+-------------------+
Code appears to hang in this ray.get call:
https://github.com/ray-project/ray/blob/ray-2.49.1/rllib/env/env_runner_group.py#L646
"""
import time
from datetime import datetime
from typing import Optional, Union
import gymnasium as gym
import numpy as np
from ray.rllib.env.multi_agent_env import make_multi_agent
from ray.rllib.utils.test_utils import (
add_rllib_example_script_args,
run_rllib_example_script_experiment,
)
from ray.tune.registry import get_trainable_cls, register_env
from ray.tune.result import TRAINING_ITERATION
parser = add_rllib_example_script_args(
default_timesteps=20000000,
default_reward=500000.0,
)
parser.set_defaults(
evaluation_num_env_runners=2,
evaluation_interval=1,
)
class MyTestEnv(gym.Env[np.ndarray, Union[int, np.ndarray]]):
def __init__(self, config):
print("INIT called: ", config)
high = np.array([np.inf, np.inf, np.inf, np.inf], dtype=np.float32)
self.action_space = gym.spaces.Discrete(2)
self.observation_space = gym.spaces.Box(-high, high, dtype=np.float32)
self.step_count = 0
self.mode = config.get("mode", "TRAIN")
def step(self, action):
print(f"STEP - {self.step_count} - {self.mode} ! {datetime.now()}")
time.sleep(5)
self.step_count += 1
return self.observation_space.sample(), 1, True, False, {}
def reset(self, *, seed: Optional[int] = None, options: Optional[dict] = None):
super().reset(seed=seed)
print(f"RESET - {self.step_count} - {self.mode} ! {datetime.now()}")
return self.observation_space.sample(), {}
MultiAgentMyTestEnv = make_multi_agent(lambda c: MyTestEnv(c))
if __name__ == "__main__":
args = parser.parse_args()
# Register our environment with tune.
if args.num_agents > 0:
register_env("env", lambda c: MultiAgentMyTestEnv(c))
else:
raise ValueError("Test run requires at least 1 agent.")
base_config = (
get_trainable_cls(args.algo)
.get_default_config()
.environment("env" if args.num_agents > 0 else "CartPole-v1")
.env_runners(create_env_on_local_worker=True)
.evaluation(
evaluation_parallel_to_training=args.evaluation_parallel_to_training,
evaluation_num_env_runners=args.evaluation_num_env_runners,
evaluation_interval=args.evaluation_interval,
evaluation_duration=args.evaluation_duration,
evaluation_duration_unit=args.evaluation_duration_unit,
evaluation_config={
"explore": False,
"metrics_num_episodes_for_smoothing": 1,
"evaluation_duration": 1,
"env_config": {"mode": "eval"},
},
)
)
# Add a simple multi-agent setup.
if args.num_agents > 0:
base_config.multi_agent(
policies={f"p{i}" for i in range(args.num_agents)},
policy_mapping_fn=lambda aid, *a, **kw: f"p{aid}",
)
# Set some PPO-specific tuning settings to learn better in the env (assumed to be
# CartPole-v1).
if args.algo == "PPO":
base_config.training(
lr=0.0003, num_epochs=1, vf_loss_coeff=0.01, minibatch_size=1, train_batch_size=1
)
run_rllib_example_script_experiment(
base_config,
args,
stop={TRAINING_ITERATION: args.stop_iters}
)Issue Severity
High: It blocks me from completing my task.