GitHub - HKUST-LongGroup/GIR-Bench: [ICLR 2026] GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

Reasoning-centric evaluation of multimodal unified models across Understanding – Generation Consistency (UGC), Text-to-Image, and Editing, revealing the persistent gap between reasoning and faithful generation.

📣 News

2026/01/26: 🎉 GIR-bench is accepted by ICLR 2026!
2025/10/14: We have released the evaluation code and the dataset for GIR-bench.

🔧 Preparations

Environment Setup

conda create -n GIR-Bench python==3.10
conda activate GIR-Bench
pip install -r requirement.txt
git clone https://github.com/facebookresearch/dinov3.git

Dataset Download

huggingface-cli download --resume --repo-type dataset lihxxx/GIR-Bench --local-dir ./dataset

Pre-Trained Weights Download

mkdir weights
huggingface-cli download --resume-download OpenGVLab/InternVL3_5-38B-HF --local-dir ./weights/InternVL3_5-38B-HF

Please download dinov3_vit7b16_pretrain_lvd1689m-a955f4ea.pth from the Meta DINOv3 Downloads page and place it under weights/.

🔥 Evaluation

GIR-Bench-UGC and GIR-Bench-T2I

bash run_evaluation_gen.sh

GIR-Bench-Edit

bash run_evaluation_edit.sh

Evaluate Your Own Model

Please organize your model outputs as below and put them into the corresponding MODELS_DIR. Default locations:

t2i: MODELS_DIR=./dataset/generation/t2i
editing: MODELS_DIR=./dataset/generation/editing

Recommended directory and naming conventions (filenames must align with the task id in the dataset):

dataset/
└── generation/
    ├── t2i/
    │   └── <YourModel>/
    │       ├── SpatialLayout/
    │       │   └── <image_id>.png
    │       ├── NumericalReasoning/
    │       │   └── <image_id>.png
    │       ├── TextRendering/
    │       │   └── <image_id>.png
    │       ├── Zoology/
    │       │   └── <image_id>.png
    │       ├── Botany/
    │       │   └── <image_id>.png
    │       └── Geography/
    │           └── <image_id>.png
    └── editing/
        └── <YourModel>/
            ├── ReasoningPerception/
            │   └── <image_id>.png
            ├── VisualLogic/
            │   └── <image_id>.png
            └── VisualPuzzle/
                └── <image_id>.png

By default, the scripts evaluate all subfolders under the configured MODELS_DIR. To evaluate only specific models:

Set the MODELS array in the shell script:

MODELS=("YourModel1" "YourModel2")

Enable the --models flag by uncommenting it in each Python call within the script:

# ... inside each python command block
--models "${MODELS[@]}"

Run the script:

bash run_evaluation_gen.sh
bash run_evaluation_edit.sh

🔍 Citation

@article{li2025gir-bench,
  title={GIR-Bench: Versatile Benchmark for Generating Images with Reasoning},
  author={Hongxiang Li, Yaowei Li, Bin Lin, Yuwei Niu, Yuhang Yang, Xiaoshuang Huang, Jiayin Cai, Xiaolong Jiang, Yao Hu, Long Chen},
  journal={arXiv preprint arXiv:2510.11026},
  year={2025}
}

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
eval		eval
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

📣 News

🔧 Preparations

Environment Setup

Dataset Download

Pre-Trained Weights Download

🔥 Evaluation

GIR-Bench-UGC and GIR-Bench-T2I

GIR-Bench-Edit

Evaluate Your Own Model

🔍 Citation

About

Uh oh!

Releases

Packages

Contributors 3

Uh oh!

Languages

HKUST-LongGroup/GIR-Bench

Folders and files

Latest commit

History

Repository files navigation

GIR-Bench: Versatile Benchmark for Generating Images with Reasoning

📣 News

🔧 Preparations

Environment Setup

Dataset Download

Pre-Trained Weights Download

🔥 Evaluation

GIR-Bench-UGC and GIR-Bench-T2I

GIR-Bench-Edit

Evaluate Your Own Model

🔍 Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Uh oh!

Languages

Packages