DEJIMA VLM

A workspace to pretrain, finetune, and evaluate DEJIMA VLM. It includes simple environment tooling, training scripts, and evaluation runners for Japanese and general benchmarks.

Related resources can be found below:

Project page: mil-tokyo/DEJIMA-dataset
Dataset construction code: mil-tokyo/DEJIMA-construct
Training / inference code: mil-tokyo/DEJIMA-VLM
Dataset (Hugging Face): MIL-UT/DEJIMA-dataset

日本語版 README はこちら / Japanese README

Repository Layout

env/
  env.sh           # Environment variables and helper sourcing
  image.def        # Container image recipe (OCI/Singularity/Apptainer style)
libraries/
  flash_attn-*.whl # Prebuilt wheels (e.g., FlashAttention)
LLaVA-NeXT/        # submodule
scripts/
  pretrain.sh      # Pretraining entry script
  finetune.sh      # Finetuning entry script
  eval/
    heron-bench.sh                 # Heron Bench evaluation
    ja-vlm-bench-in-the-wild.sh    # Japanese VLM-in-the-wild evaluation
src/
  eval/
    heron-bench.py                 # Python runner for Heron Bench
    ja-vlm-bench-in-the-wild.py    # Python runner for Japanese VLM-in-the-wild

Requirements

Apptainer/Singularity to build env/image.def

Setup

Clone the repository with submodules.
Download FlashAttention wheel
Run the environment script env/env.sh.

Clone (with submodules)

git clone https://github.com/mil-tokyo/DEJIMA-VLM
# If already cloned without submodules
git submodule update --init --recursive

Download FlashAttention wheel

For GPU training on Linux with CUDA 12 and PyTorch 2.1, download the prebuilt wheel into libraries/:

# Create libraries directory if missing
mkdir -p libraries

# Download from official FlashAttention releases
curl -L \
  -o libraries/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl \
  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Initialize environment

cd env
source env/env.sh

Requirement: Apptainer (Singularity) must be available in your environment to build and run the provided container definition.

Training

Use the shell scripts for consistent runs.

Dataset Format (JSON)

Overview: Training data is a JSON array; each element is one sample.
Required fields: id (integer), image (relative path string), conversations (array).
Conversation schema: conversations is an array of messages; each message contains from and value.
- from: either "human" or "gpt"
- value: the message text. For prompts that reference an image, include <image> at the beginning.
Image path: image is relative to the dataset root, e.g., 00871/008713014.jpg.

Example

Provide data in the following format:

[
  {
    "id": 8713014,
    "image": "00871/008713014.jpg",
    "conversations": [
      {
          "from": "human",
          "value": "<image>\nねぶた山車のイラストに描かれている武器は何ですか？"
      },
      {
          "from": "gpt",
          "value": "イラストには、ねぶた山車の装飾として太刀（つるぎ）が描かれています。この太刀は画面右側に大きく配置され、伝統的な武具のデザインが詳細に表現されています。"
      }
    ]
  },
  { ... },
]

Pretraining

cd scripts
./scripts/pretrain.sh

Finetuning

cd scripts
./scripts/finetune.sh

Notes:

The scripts assume LLaVA-NeXT code is present in LLaVA-NeXT/.
Customize training arguments inside scripts/pretrain.sh and scripts/finetune.sh.

Evaluation

Run evaluations via provided bash scripts or Python runners.

Heron Bench

# Bash wrapper
./scripts/eval/heron-bench.sh

Japanese VLM-in-the-wild

# Bash wrapper
./scripts/eval/ja-vlm-bench-in-the-wild.sh

License

Apache License 2.0

Citation

If you use DEJIMA in your research, please cite our paper (to appear).

@misc{katsube2025dejimanovellargescalejapanese,
      title={DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering}, 
      author={Toshiki Katsube and Taiga Fukuhara and Kenichiro Ando and Yusuke Mukuta and Kohei Uehara and Tatsuya Harada},
      year={2025},
      eprint={2512.00773},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.00773}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.github		.github
LLaVA-NeXT @ d413e0f		LLaVA-NeXT @ d413e0f
env		env
scripts		scripts
src/eval		src/eval
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
README_ja.md		README_ja.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

DEJIMA VLM

Repository Layout

Requirements

Setup

Clone (with submodules)

Download FlashAttention wheel

Initialize environment

Training

Dataset Format (JSON)

Example

Pretraining

Finetuning

Evaluation

Heron Bench

Japanese VLM-in-the-wild

License

Citation

About

Uh oh!

Releases

Packages

Languages

License

mil-tokyo/DEJIMA-VLM

Folders and files

Latest commit

History

Repository files navigation

DEJIMA VLM

Repository Layout

Requirements

Setup

Clone (with submodules)

Download FlashAttention wheel

Initialize environment

Training

Dataset Format (JSON)

Example

Pretraining

Finetuning

Evaluation

Heron Bench

Japanese VLM-in-the-wild

License

Citation

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages