A workspace to pretrain, finetune, and evaluate DEJIMA VLM. It includes simple environment tooling, training scripts, and evaluation runners for Japanese and general benchmarks.
Related resources can be found below:
- Project page: mil-tokyo/DEJIMA-dataset
- Dataset construction code: mil-tokyo/DEJIMA-construct
- Training / inference code: mil-tokyo/DEJIMA-VLM
- Dataset (Hugging Face): MIL-UT/DEJIMA-dataset
日本語版 README はこちら / Japanese README
env/
env.sh # Environment variables and helper sourcing
image.def # Container image recipe (OCI/Singularity/Apptainer style)
libraries/
flash_attn-*.whl # Prebuilt wheels (e.g., FlashAttention)
LLaVA-NeXT/ # submodule
scripts/
pretrain.sh # Pretraining entry script
finetune.sh # Finetuning entry script
eval/
heron-bench.sh # Heron Bench evaluation
ja-vlm-bench-in-the-wild.sh # Japanese VLM-in-the-wild evaluation
src/
eval/
heron-bench.py # Python runner for Heron Bench
ja-vlm-bench-in-the-wild.py # Python runner for Japanese VLM-in-the-wild
- Apptainer/Singularity to build
env/image.def
- Clone the repository with submodules.
- Download FlashAttention wheel
- Run the environment script
env/env.sh.
git clone https://github.com/mil-tokyo/DEJIMA-VLM
# If already cloned without submodules
git submodule update --init --recursiveFor GPU training on Linux with CUDA 12 and PyTorch 2.1, download the prebuilt wheel into libraries/:
# Create libraries directory if missing
mkdir -p libraries
# Download from official FlashAttention releases
curl -L \
-o libraries/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl \
https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whlcd env
source env/env.shRequirement: Apptainer (Singularity) must be available in your environment to build and run the provided container definition.
Use the shell scripts for consistent runs.
- Overview: Training data is a JSON array; each element is one sample.
- Required fields:
id(integer),image(relative path string),conversations(array). - Conversation schema:
conversationsis an array of messages; each message containsfromandvalue.from: either"human"or"gpt"value: the message text. For prompts that reference an image, include<image>at the beginning.
- Image path:
imageis relative to the dataset root, e.g.,00871/008713014.jpg.
Provide data in the following format:
[
{
"id": 8713014,
"image": "00871/008713014.jpg",
"conversations": [
{
"from": "human",
"value": "<image>\nねぶた山車のイラストに描かれている武器は何ですか?"
},
{
"from": "gpt",
"value": "イラストには、ねぶた山車の装飾として太刀(つるぎ)が描かれています。この太刀は画面右側に大きく配置され、伝統的な武具のデザインが詳細に表現されています。"
}
]
},
{ ... },
]
cd scripts
./scripts/pretrain.shcd scripts
./scripts/finetune.shNotes:
- The scripts assume LLaVA-NeXT code is present in
LLaVA-NeXT/. - Customize training arguments inside
scripts/pretrain.shandscripts/finetune.sh.
Run evaluations via provided bash scripts or Python runners.
# Bash wrapper
./scripts/eval/heron-bench.sh# Bash wrapper
./scripts/eval/ja-vlm-bench-in-the-wild.shApache License 2.0
If you use DEJIMA in your research, please cite our paper (to appear).
@misc{katsube2025dejimanovellargescalejapanese,
title={DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering},
author={Toshiki Katsube and Taiga Fukuhara and Kenichiro Ando and Yusuke Mukuta and Kohei Uehara and Tatsuya Harada},
year={2025},
eprint={2512.00773},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.00773},
}