Skip to content

Legacy code from PettingZoo environment. Centralized learning, de-centralized execution.

License

Notifications You must be signed in to change notification settings

doesburg11/PredPreyGrass-pettingzoo-legacy

Repository files navigation

PettingZoo version dependency Stable Baselines3

Legacy framework: PettingZoo & Stable Baselines3 framework

Centralized training, decentralized evaluation

The MARL environment predpregrass_base.py is implemented using PettingZoo, and the agents are trained using Stable-Baselines3 (SB3) PPO. Essentially this solution demonstrates how SB3 can be adapted for MARL using parallel environments and centralized training. Rewards (stepping, eating, dying and reproducing) are aggregated and can be adjusted in the environment configuration file. Stable Baseline3 is originally designed for single-agent training. This means that in this solution, training utilizes only one unified network for Predators as well Prey. See further below how SB3 PPO is used in this centralilzed trained Predator-Prey-Grass multi-agent setting.

Random policy Predator-Prey-Grass PettingZoo environment

Random policy with the PettingZoo framework

Training model using PPO from stable baselines3

Configuration environment parameters

Evaluate and visualize trained model

Batch training and evaluating in one go:

How SB3 PPO is used in the Predator-Prey-Grass Multi-Agent Setting

1. PettingZoo AEC to Parallel Conversion

  • The environment is initially implemented as an Agent-Environment-Cycle (AEC) environment using PettingZoo (predpregrass_aec.py which inherits from predpregrass_base.py).
  • It is wrapped and converted into a Parallel Environment using aec_to_parallel() inside trainer.py.
  • This conversion enables multiple agents to take actions simultaneously rather than sequentially.

2. Treating Multi-Agent Learning as a Single-Agent Problem

  • SB3 PPO expects a single-agent Gymnasium-style environment.
  • The converted parallel environment stacks observations and actions for all agents, making it appear as a single large observation-action space.
  • PPO then treats the multi-agent problem as a centralized learning problem, where all agents share one policy.

3. Performance Optimization with Vectorized Environments

  • The environment is further wrapped using SuperSuit:
    env = ss.pettingzoo_env_to_vec_env_v1(env)
    env = ss.concat_vec_envs_v1(env, num_vec_envs, num_cpus=num_cores, base_class="stable_baselines3")
  • This enables running multiple instances of the environment in parallel, significantly improving training efficiency.
  • The training process treats the multi-agent setup as a single centralized policy, where PPO learns from the collective experiences of all agents.

Centralized training and decentralized evaluation

Predator-Prey-Grass PettingZoo environment centralized trained using SB3's PPO

Emergent Behaviors

Training the single objective environment predpregrass_base.py with the SB3 PPO algorithm is an example of how elaborate behaviors can emerge from simple rules in agent-based models. In the above displayed MARL example, rewards for learning agents are solely obtained by reproduction. So all other reward options are set to zero in the environment configuration. Despite this relativily sparse reward structure, maximizing these rewards results in elaborate emerging behaviors such as:

  • Predators hunting Prey
  • Prey finding and eating grass
  • Predators hovering around grass to catch Prey
  • Prey trying to escape Predators

Moreover, these learning behaviors lead to more complex emergent dynamics at the ecosystem level. The trained agents are displaying a classic Lotka–Volterra pattern over time:

About

Legacy code from PettingZoo environment. Centralized learning, de-centralized execution.

Topics

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

 

Packages

No packages published

Languages