Skip to content

This project provides tools to create training datasets for NitroGen using the BizHawk emulator.

License

Notifications You must be signed in to change notification settings

artryazanov/nitrogen-bizhawk-dataset-generator

Repository files navigation

🎮 NitroGen BizHawk Dataset Generator

Lua CI Python CI Docker Build License Lua Python

This project provides tools to create training datasets for NitroGen using the BizHawk emulator.

It consists of two parts:

  1. Lua Script (export_dataset.lua): Runs inside BizHawk to export gameplay frames and controller input.
  2. Python Script (convert_dataset.py): Converts the exported data into a Parquet file compatible with NitroGen training and pre-processes images (saves as Hugging Face datasets Image type).

📋 Prerequisites

  • BizHawk Emulator (Version 2.9+ recommended)
  • Python 3.8+
  • Git (optional, for cloning)

📦 Installation

  1. Clone this repository or download the files.
  2. Install Python dependencies:
pip install -r requirements.txt

🚀 Usage

Phase 1: 🎞️ Exporting from BizHawk

  1. Open BizHawk.
  2. Load your ROM (NES or SNES recommended).
  3. Load a Movie file (.bk2) that you want to convert to a dataset.
    • Tip: Ensure the movie mode is set to "Play".
  4. Open the Lua Console (Tools > Lua Console).
  5. Click Script > Open Script and select export_dataset.lua.
  6. The script will automatically create a nitrogen_dataset/ folder and start exporting.
  7. The script will automatically stop when the movie finishes.

Note: The script creates three items in your output directory:

  • frames/: Folder containing raw frame_XXXXXX.png images.
  • actions.csv: Raw CSV file with input data.
  • dataset_config.json: Configuration file containing the detected logic (e.g., resize mode based on console).

Phase 2: 🖼️ Converting and Processing

Once the Lua export is complete, use the Python script to package the data and process the images.

  1. Open a terminal in the project directory.
  2. Run the converter:
# Default usage 
# Reads from 'nitrogen_dataset/'
# Saves parquet to 'nitrogen_dataset/train.parquet' (images embedded)
python convert_dataset.py

# Specify custom input directory
python convert_dataset.py --input /path/to/my_export

# Skip image processing (only convert CSV)
python convert_dataset.py --skip-images
  1. The output will contain:
    • train.parquet: The single-file dataset containing both actions and embedded images (Hugging Face datasets compatible format).

🐳 Functionality via Docker

You can also run the converter using Docker, which handles all dependencies (including OpenCV) for you.

  1. Build the Image:

    docker build -t nitrogen-converter .
  2. Run the Container: You need to mount your local dataset folder into the container.

    # Run against the 'nitrogen_dataset' folder in your current directory
    docker run --rm -v $(pwd)/nitrogen_dataset:/app/dataset nitrogen-converter --input /app/dataset --output /app/dataset/train.parquet

🧩 Image Processing Logic

The scripts automatically detect the best resize mode based on the console:

  • NES: Uses Crop mode (centers and crops to 256x256) to remove overscan borders.
  • SNES: Uses Pad mode (adds black borders) to maintain aspect ratio within 256x256.

This configuration is saved in dataset_config.json by the Lua script and applied by the Python script.

🧪 Testing

This project includes tests for both the Python and Lua components.

🐍 Python Tests

The Python tests cover image preprocessing and dataset conversion logic.

  1. Calculated dependencies are required (installed via requirements.txt), plus pytest.
    pip install pytest
  2. Run the tests:
    pytest tests/

🌙 Lua Tests

The Lua tests validation the input mapping logic and ensure the script structure is correct.

  1. Requires a standard Lua 5.4 interpreter.
  2. Run the tests:
    lua tests/test_export_dataset.lua

🌍 Generated Datasets

Check out a real-world example of a dataset created with this tool:

A complete gameplay dataset of World 1, formatted for training vision-to-action models like NitroGen.

  • Game: Felix the Cat (NES)
  • Format: Parquet (images + controller inputs)
  • Size: ~25,000 frames
  • Source: Recorded via BizHawk, processed with this generator.
from datasets import load_dataset

# Load the dataset directly from Hugging Face
dataset = load_dataset("artryazanov/nitrogen-bizhawk-nes-felix-the-cat-world-1", split="train")

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

About

This project provides tools to create training datasets for NitroGen using the BizHawk emulator.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published