Here is the source code for other research methods:
AlphaZero-OmniFive applies the AlphaZero algorithm to Gomoku (Five in a Row), training a policy-value network purely through self-play data combined with Monte Carlo Tree Search (MCTS) for decision-making. Since Gomoku's state space is much smaller than Go or Chess, a competitive AI can be trained in just a few hours on a PC with a CUDA-enabled GPU.
This repo is based on AlphaZero_Gomoku,And make the following modifications:
- Changed the network architecture from CNN to ResNet
- Optimized MCTS and self-play modules by leveraging PyTorch CUDA acceleration
- Added a new config.json file for centralized parameter management
- Added GUI based on tkinter
- Added feature Zero Padding for better performance
- Added feature dynamic_training_parameters for better performance
- Tuned training parameters specifically for large boards of size 9x9 and above
- Added the models trained using this parameter
- AlphaGo: Combines expert game records, hand-crafted features, and move prediction with MCTS, further enhanced through self-play.
- AlphaZero: Starts from scratch using only game rules for self-play, employs residual convolutional networks to output both policy and value simultaneously with MCTS; abandons hand-crafted features and human game records for a simpler architecture with more efficient training and inference, surpassing AlphaGo in strength.
- Python >= 3.13
- PyTorch >= 2.9 (CUDA 12.8 required)
- numpy >= 2.2
Otherwise, I'm used WSL2 as the platform for this project
git clone https://github.com/Suyw-0123/AlphaZero-OmniFive.git
cd AlphaZero-OmniFivepython human_play.pyBefore playing, you need to adjust the parameters in config.json to the appropriate chessboard size and Channel and Block used in ResNet training for the corresponding model.full description here flowchart
python train.pyTraining workflow includes:
- Self-play to collect game records with rotation and flipping augmentation.
- Mini-batch updates to the policy-value network.
- Periodic evaluation against a pure MCTS opponent; if win rate improves, overwrite
best_policy.model.
Output models:
current_policy.model: The most recently trained network.best_policy.model: The network with the best evaluation performance so far.
| Parameter | Description |
|---|---|
board_width / board_height |
Board dimensions; adjust n_in_row accordingly when changing. |
n_in_row |
Win condition (five in a row). Determines game difficulty along with board size. |
| Parameter | Description |
|---|---|
num_channels |
Number of feature channels in residual blocks. Higher values increase model capacity. |
num_res_blocks |
Number of residual blocks in the tower. More blocks enable deeper feature extraction. |
| Parameter | Description |
|---|---|
learn_rate |
Initial Adam learning rate. Dynamically scaled by lr_multiplier based on KL divergence. |
lr_multiplier |
Multiplicatively adjusted when KL exceeds or falls below thresholds, controlling learning rate decay or recovery. |
temp |
Temperature during self-play, controlling move exploration; can be lowered later to reduce randomness. |
n_playout |
Number of MCTS simulations per move. Higher values increase strength but also inference time. |
c_puct |
MCTS exploration coefficient, balancing high visit counts and high-scoring nodes. |
buffer_size |
Self-play data buffer capacity; larger values retain more historical games for training. |
batch_size |
Number of samples per gradient update. Adjust based on GPU memory. |
play_batch_size |
Number of games generated per self-play round. |
epochs |
Number of mini-batch iterations per update, improving convergence speed. |
kl_targ |
Target KL divergence, limiting policy change between old and new, working with lr_multiplier to control step size. |
check_freq |
Frequency (in batches) for MCTS evaluation and model saving. |
game_batch_num |
Training loop upper limit; Ctrl+C saves the current best model. |
pure_mcts_playout_num |
Number of simulations for the pure MCTS opponent during evaluation. Higher values make evaluation stricter. |
use_gpu |
Whether to use GPU acceleration for training and inference. |
init_model |
Path to the initial model file to resume training from a checkpoint. |
| Parameter | Description |
|---|---|
model_file |
Path to the model file used for human vs AI games. |
start_player |
Set to 0 for human first, 1 for AI first. |
n_playout |
Number of MCTS simulations per move for the AI during human play. |
c_puct |
MCTS exploration coefficient for human play. |
use_gpu |
Whether to use GPU acceleration for inference during human play. |
GPU Memory Optimization: Adjust
batch_size,num_channels, andnum_res_blocksaccording to your GPU memory. Lower values reduce model size and memory usage.
To evaluate the strength of your trained model, you can pit it against a pure MCTS (Monte Carlo Tree Search) opponent using the battle.py script.
python battle.pyThis script will:
- Load the trained model specified in
config.jsonunderhuman_play.model_file. - Create a pure MCTS player to act as the opponent.
- Start a game and print the board state to the console after each move.
You can adjust the battle parameters directly within the battle.py file:
- Opponent Strength: Modify the
pure_mcts_playoutvariable to increase or decrease the thinking time and strength of the pure MCTS player. - First Move: Change the
start_playerargument in thegame.start_play()function call. Set it to0for your trained model to go first, or1for the pure MCTS player to go first.
- Special thanks to AlphaZero_Gomoku for providing the core codebase.
- Play, Learn, Conquer: The Journey to a Self-Improving Game AI (https://benlu.substack.com/p/play-learn-conquer-the-journey-to)
- Silver et al., Mastering the game of Go with deep neural networks and tree search (Nature, 2016)
- Silver et al., Mastering the game of Go without human knowledge (Nature, 2017)
- Special thanks to the reference materials, which were of great help to our training planning, appreciate.
