This project provides tools to create training datasets for NitroGen using the BizHawk emulator.
It consists of two parts:
- Lua Script (
export_dataset.lua): Runs inside BizHawk to export gameplay frames and controller input. - Python Script (
convert_dataset.py): Converts the exported data into a Parquet file compatible with NitroGen training and pre-processes images (saves as Hugging FacedatasetsImage type).
- BizHawk Emulator (Version 2.9+ recommended)
- Python 3.8+
- Git (optional, for cloning)
- Clone this repository or download the files.
- Install Python dependencies:
pip install -r requirements.txt- Open BizHawk.
- Load your ROM (NES or SNES recommended).
- Load a Movie file (
.bk2) that you want to convert to a dataset.- Tip: Ensure the movie mode is set to "Play".
- Open the Lua Console (
Tools > Lua Console). - Click Script > Open Script and select
export_dataset.lua. - The script will automatically create a
nitrogen_dataset/folder and start exporting. - The script will automatically stop when the movie finishes.
Note: The script creates three items in your output directory:
frames/: Folder containing rawframe_XXXXXX.pngimages.actions.csv: Raw CSV file with input data.dataset_config.json: Configuration file containing the detected logic (e.g., resize mode based on console).
Once the Lua export is complete, use the Python script to package the data and process the images.
- Open a terminal in the project directory.
- Run the converter:
# Default usage
# Reads from 'nitrogen_dataset/'
# Saves parquet to 'nitrogen_dataset/train.parquet' (images embedded)
python convert_dataset.py
# Specify custom input directory
python convert_dataset.py --input /path/to/my_export
# Skip image processing (only convert CSV)
python convert_dataset.py --skip-images- The output will contain:
train.parquet: The single-file dataset containing both actions and embedded images (Hugging Facedatasetscompatible format).
You can also run the converter using Docker, which handles all dependencies (including OpenCV) for you.
-
Build the Image:
docker build -t nitrogen-converter . -
Run the Container: You need to mount your local dataset folder into the container.
# Run against the 'nitrogen_dataset' folder in your current directory docker run --rm -v $(pwd)/nitrogen_dataset:/app/dataset nitrogen-converter --input /app/dataset --output /app/dataset/train.parquet
The scripts automatically detect the best resize mode based on the console:
- NES: Uses Crop mode (centers and crops to 256x256) to remove overscan borders.
- SNES: Uses Pad mode (adds black borders) to maintain aspect ratio within 256x256.
This configuration is saved in dataset_config.json by the Lua script and applied by the Python script.
This project includes tests for both the Python and Lua components.
The Python tests cover image preprocessing and dataset conversion logic.
- Calculated dependencies are required (installed via
requirements.txt), pluspytest.pip install pytest
- Run the tests:
pytest tests/
The Lua tests validation the input mapping logic and ensure the script structure is correct.
- Requires a standard Lua 5.4 interpreter.
- Run the tests:
lua tests/test_export_dataset.lua
Check out a real-world example of a dataset created with this tool:
A complete gameplay dataset of World 1, formatted for training vision-to-action models like NitroGen.
- Game: Felix the Cat (NES)
- Format: Parquet (images + controller inputs)
- Size: ~25,000 frames
- Source: Recorded via BizHawk, processed with this generator.
from datasets import load_dataset
# Load the dataset directly from Hugging Face
dataset = load_dataset("artryazanov/nitrogen-bizhawk-nes-felix-the-cat-world-1", split="train")This project is licensed under the MIT License - see the LICENSE file for details.