The MARL environment predpregrass_base.py is implemented using PettingZoo, and the agents are trained using Stable-Baselines3 (SB3) PPO. Essentially this solution demonstrates how SB3 can be adapted for MARL using parallel environments and centralized training. Rewards (stepping, eating, dying and reproducing) are aggregated and can be adjusted in the environment configuration file. Stable Baseline3 is originally designed for single-agent training. This means that in this solution, training utilizes only one unified network for Predators as well Prey. See further below how SB3 PPO is used in this centralilzed trained Predator-Prey-Grass multi-agent setting.
Random policy Predator-Prey-Grass PettingZoo environment
- The environment is initially implemented as an Agent-Environment-Cycle (AEC) environment using PettingZoo (
predpregrass_aec.pywhich inherits frompredpregrass_base.py). - It is wrapped and converted into a Parallel Environment using
aec_to_parallel()insidetrainer.py. - This conversion enables multiple agents to take actions simultaneously rather than sequentially.
- SB3 PPO expects a single-agent Gymnasium-style environment.
- The converted parallel environment stacks observations and actions for all agents, making it appear as a single large observation-action space.
- PPO then treats the multi-agent problem as a centralized learning problem, where all agents share one policy.
- The environment is further wrapped using SuperSuit:
env = ss.pettingzoo_env_to_vec_env_v1(env) env = ss.concat_vec_envs_v1(env, num_vec_envs, num_cpus=num_cores, base_class="stable_baselines3")
- This enables running multiple instances of the environment in parallel, significantly improving training efficiency.
- The training process treats the multi-agent setup as a single centralized policy, where PPO learns from the collective experiences of all agents.
Predator-Prey-Grass PettingZoo environment centralized trained using SB3's PPO
Training the single objective environment predpregrass_base.py with the SB3 PPO algorithm is an example of how elaborate behaviors can emerge from simple rules in agent-based models. In the above displayed MARL example, rewards for learning agents are solely obtained by reproduction. So all other reward options are set to zero in the environment configuration. Despite this relativily sparse reward structure, maximizing these rewards results in elaborate emerging behaviors such as:
- Predators hunting Prey
- Prey finding and eating grass
- Predators hovering around grass to catch Prey
- Prey trying to escape Predators
Moreover, these learning behaviors lead to more complex emergent dynamics at the ecosystem level. The trained agents are displaying a classic Lotka–Volterra pattern over time:



