Skip to content

mil-tokyo/DEJIMA-VLM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

DEJIMA VLM

A workspace to pretrain, finetune, and evaluate DEJIMA VLM. It includes simple environment tooling, training scripts, and evaluation runners for Japanese and general benchmarks.

Related resources can be found below:

日本語版 README はこちら / Japanese README

Repository Layout

env/
  env.sh           # Environment variables and helper sourcing
  image.def        # Container image recipe (OCI/Singularity/Apptainer style)
libraries/
  flash_attn-*.whl # Prebuilt wheels (e.g., FlashAttention)
LLaVA-NeXT/        # submodule
scripts/
  pretrain.sh      # Pretraining entry script
  finetune.sh      # Finetuning entry script
  eval/
    heron-bench.sh                 # Heron Bench evaluation
    ja-vlm-bench-in-the-wild.sh    # Japanese VLM-in-the-wild evaluation
src/
  eval/
    heron-bench.py                 # Python runner for Heron Bench
    ja-vlm-bench-in-the-wild.py    # Python runner for Japanese VLM-in-the-wild

Requirements

  • Apptainer/Singularity to build env/image.def

Setup

  1. Clone the repository with submodules.
  2. Download FlashAttention wheel
  3. Run the environment script env/env.sh.

Clone (with submodules)

git clone https://github.com/mil-tokyo/DEJIMA-VLM
# If already cloned without submodules
git submodule update --init --recursive

Download FlashAttention wheel

For GPU training on Linux with CUDA 12 and PyTorch 2.1, download the prebuilt wheel into libraries/:

# Create libraries directory if missing
mkdir -p libraries

# Download from official FlashAttention releases
curl -L \
  -o libraries/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl \
  https://github.com/Dao-AILab/flash-attention/releases/download/v2.7.3/flash_attn-2.7.3+cu12torch2.1cxx11abiFALSE-cp310-cp310-linux_x86_64.whl

Initialize environment

cd env
source env/env.sh

Requirement: Apptainer (Singularity) must be available in your environment to build and run the provided container definition.

Training

Use the shell scripts for consistent runs.

Dataset Format (JSON)

  • Overview: Training data is a JSON array; each element is one sample.
  • Required fields: id (integer), image (relative path string), conversations (array).
  • Conversation schema: conversations is an array of messages; each message contains from and value.
    • from: either "human" or "gpt"
    • value: the message text. For prompts that reference an image, include <image> at the beginning.
  • Image path: image is relative to the dataset root, e.g., 00871/008713014.jpg.

Example

Provide data in the following format:

[
  {
    "id": 8713014,
    "image": "00871/008713014.jpg",
    "conversations": [
      {
          "from": "human",
          "value": "<image>\nねぶた山車のイラストに描かれている武器は何ですか?"
      },
      {
          "from": "gpt",
          "value": "イラストには、ねぶた山車の装飾として太刀(つるぎ)が描かれています。この太刀は画面右側に大きく配置され、伝統的な武具のデザインが詳細に表現されています。"
      }
    ]
  },
  { ... },
]

Pretraining

cd scripts
./scripts/pretrain.sh

Finetuning

cd scripts
./scripts/finetune.sh

Notes:

  • The scripts assume LLaVA-NeXT code is present in LLaVA-NeXT/.
  • Customize training arguments inside scripts/pretrain.sh and scripts/finetune.sh.

Evaluation

Run evaluations via provided bash scripts or Python runners.

Heron Bench

# Bash wrapper
./scripts/eval/heron-bench.sh

Japanese VLM-in-the-wild

# Bash wrapper
./scripts/eval/ja-vlm-bench-in-the-wild.sh

License

Apache License 2.0

Citation

If you use DEJIMA in your research, please cite our paper (to appear).

@misc{katsube2025dejimanovellargescalejapanese,
      title={DEJIMA: A Novel Large-scale Japanese Dataset for Image Captioning and Visual Question Answering}, 
      author={Toshiki Katsube and Taiga Fukuhara and Kenichiro Ando and Yusuke Mukuta and Kohei Uehara and Tatsuya Harada},
      year={2025},
      eprint={2512.00773},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.00773}, 
}

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published