This repo contains examples of solving reinforcement learning scenarios from unity mlagents with TorchRL.
1. Installation
Clone the repo and run the auto-install script. This handles the complex dependency conflicts (mlagents vs numpy) for you. Requires Conda.
git clone https://github.com/notDroid/unity-rl.git
cd unity-rl
git checkout v1
bash install.sh
conda activate mlagents2. CLI Usage
You can list available models and run them immediately from the command line.
# List all available environments and models
python play.py ls
# Run a specific environment (auto-downloads model from HF)
python play.py Crawler ppo conf1 run9 --graphics3. Python Usage
Minimal example to load an agent and run a rollout:
from utils import PPOAgent
from rlkit.envs import UnityEnv
# 1. Load Agent (Auto-downloads from Hugging Face)
agent = PPOAgent('Crawler', 'conf1', 'run9')
policy = agent.get_policy_operator()
# 2. Run Environment (Auto-downloads from mlagents registry)
env = UnityEnv(name='Crawler', graphics=True)
with torch.no_grad():
env.rollout(1000, policy=policy, break_when_any_done=False)Check quickstart.ipynb for a complete walkthrough.
There are 2 main components:
1. rlkit
- rlkit contains algorithms (like ppo, sac), unity environments (with torchrl transforms), and other utility.
env = UnityEnv(name='Crawler', path=None, graphics=True, time_scale=1, seed=1)
agent = PPOAgent('Crawler', 'conf1', 'run9')2. experiment runner
- I use hydra to manage configs for (environment, algorithm, config) tuples.
- Experiment results are under experiments/, configs under configs/, and the code for the experiment runner has its entry point at run_experiment.py.
python run_experiment.py -cn "config_name" +verbose=True +continue_=False run_name="run_name"Both have huggingface integration to upload/download models, checkpoints, logs automatically at https://huggingface.co/notnotDroid/unity-rl (default).
You can either use the built in unity environments or download them manually. The manual download ones look better and may be necessary if the unity registry is down.
Manual Download
-
Download the repo containing the environments.
-
Then open the project in the unity editor (select the Project/ folder from mlagents), select a scene from an environment and build it for whatever platform you're on.
-
Create an env/ folder at the root of this repo and place compiled environments in it.
Either run install.sh or manually install the dependencies.
Manual Install
First of all conda is required (something weird about grpcio, wheel won't build) so make sure its properly setup. Then run these at project root:
# Create conda environment
conda create -n mlagents python=3.10.12
conda activate mlagents
# Install mlagents python interface
conda install "grpcio=1.48.2" -c conda-forge
python -m pip install mlagents==1.1.0
python -m pip install numpy==2.2.6
# Install toolkit
python -m pip install pandas matplotlib ipykernel hydra-core seaborn huggingface_hub torchinfo
python -m pip install torch torchrl
python -m pip install -e rlkitNote that the numpy version conflicts with mlagents because of gym (deprecated), but we don't use gym anyways so we are safe to use the latest version of numpy. This also means we have to manual download everything (no requirements.txt).
This package contains reusable resources:
- mlagent environments (with torchrl transforms)
- training templates (ppo/sac)
- The training templates are meant to be used as templates rather than robust algorithms (customize them).
- utils (checkpointer/logger)
- models (mlp/cnn)
- Finish models for vector environments (3DBall, Crawler, PushBlock, Walker, WallJump, Worm).
- Add support for visual environments (GridWorld, Match 3)
- Add support for multi agent environments (food collector, soccer twos, striker vs. goalie, co-op pushblock, dungeon escape)
- Add SAC
- Add support for sparse reward environments (hallway, pryamids)
- Add support for variable length observation environments (sorter)
- Also add Docker support
This package can be used to train RL agents on Unity ML-Agents environments. It's meant to be highly modular and customizable. There are 3 main steps:
- TorchRL Compatible Environment
- Algorithm Template
- Config File
You can use existing (env, algo, config) tuples or create your own as needed.
python run_experiment.py -cn <config_name> +verbose=True +continue_=False run_name=<run_name> repo_id=<huggingface_repo_id> hf_sync_interval=<sync_interval>Arguments:
config_name: Name of the config file under configs/ (without .yaml)run_name: Name of the run (used for logging/checkpointing), should be unique in the scope of a config.- If using huggingface integration make sure to authenticate your account with
hf auth, otherwise don't specify async_interval. - Configs provided use tensorboard logger by default, you can change it or view the logs at the directory:
experiments/<env>/<algo>/<config>/logs/<run_name>/. The nested structure let's you compare many runs under different algorithms/configs by specifying a more general path.
Environments and Configs:
- You can create your own environments and configs by using the existing ones as templates (should be relatively straightforward).
Algorithms:
- The main feature of this project is the training templates like (ppo/sac) which you can copy and modify as needed (for instance adapting PPO for diffusion policies).
- Each training template also has a corresponding runner with it that handles config files.
My configs are likely not optimal, but they work reasonably well. Feel free to open an issue or PR if you have better hyperparameters or training tricks.
Reproducibility
Models and training logs (with plots) are available on Hugging Face. Training runs can be reproduced with:
python run_experiment.py -cn <config_name> +verbose=True +continue_=False run_name=<run_name>Results
The following table summarizes the average returns achieved by each (environment, algorithm, config) tuple.
- For environments with truncation the window is 1000 timesteps.
| Environment | Algorithm | Config File | Average Return | Episode Length | Timesteps Trained |
|---|---|---|---|---|---|
| 3DBall | PPO | 3dball_ppo | 100 | 1000 | 400k |
| PushBlock | PPO | pushblock_ppo | 4.9 | 48.2 | 50M |
| WallJump | PPO | walljump_ppo | 0.96 | 29.7 | 500M |
| Crawler | PPO | crawler_ppo | 360 | 1000 | 400M |
| Worm | PPO | worm_ppo | 100 | 1000 | 100M |
| Walker | PPO | walker_ppo | 25 | 1000 | 1.6B |





