This is the official implementation of the paper BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding, which is accepted by CVPR2025.
Large video-language models (VLMs) have demonstrated promising progress in various video understanding tasks. However, their effectiveness in long-form video analysis is constrained by limited context windows. Traditional approaches, such as uniform frame sampling, often inevitably allocate resources to irrelevant content, diminishing their effectiveness in real-world scenarios. In this paper, we introduce BOLT, a method to BOost Large VLMs without additional Training through a comprehensive study of frame selection strategies.
First, to enable a more realistic evaluation of VLMs in long-form video understanding, we propose a multi-source retrieval evaluation setting. Our findings reveal that uniform sampling performs poorly in noisy contexts, underscoring the importance of selecting the right frames. Second, we explore several frame selection strategies based on query-frame similarity and analyze their effectiveness at inference time. Our results show that inverse transform sampling yields the most significant performance improvement, increasing accuracy on the Video-MME benchmark from 53.8% to 56.1% and MLVU benchmark from 58.9% to 63.4%.
We use lmms-eval as our evaluation framework. Follow these steps to set up the environment:
# Create and activate conda environment
conda create -n bolt python=3.12
conda activate bolt
# Clone the repository with submodules
git clone --recurse-submodules https://github.com/sming256/BOLT.git
cd BOLT
# Install dependencies
pip install -e third_party/lmms-eval
pip install -e third_party/LLaVA-NeXT/
pip install -e third_party/qwen-vl-utils/The inverse transform sampling in BOLT is implemented in here.
The frame selection for LLaVA-OneVision/Qwen2.5-VL/Qwen3-VL under lmms-eval framework is implemented in here.
We use the CLIP ViT-L/14 image encoder to extract frame features for frame selection. Before running the script, please download the raw video datasets and place them in the corresponding paths as specified in the bash file.
bash scripts/feature_extraction.shNote: You can also directly download our pre-extracted video features from HuggingFace.
By using inverse transform sampling based on query-frame similarity, we select the most informative frames for each video.
bash scripts/frame_selection.shNote: You can also directly download our pre-computed keyframe indices from HuggingFace.
Baseline Evaluation (Uniform Sampling)
# LLaVA-OneVision with uniform sampling (8/16/32 frames)
bash scripts/eval/llava_ov_baseline.sh
# Qwen2.5-VL with uniform sampling (8/16/32 frames)
bash scripts/eval/qwen2_5_vl_baseline.sh
# Qwen3-VL with uniform sampling (8/16/32 frames)
bash scripts/eval/qwen3_vl_baseline.shBOLT Evaluation (Our Method)
# LLaVA-OneVision with BOLT (8/16/32 frames)
bash scripts/eval/llava_ov_bolt.sh
# Qwen2.5-VL with BOLT (8/16/32 frames)
bash scripts/eval/qwen2_5_vl_bolt.sh
# Qwen3-VL with BOLT (8/16/32 frames)
bash scripts/eval/qwen3_vl_bolt.shVideo-MME Benchmark
| Model | Sampling Method | Acc (8 frames) | Acc (16 frames) | Acc (32 frames) |
|---|---|---|---|---|
| LLaVA-OneVision-7B | Uniform | 54.0 | 56.7 | 58.5 |
| LLaVA-OneVision-7B | BOLT | 56.1 (+2.1) | 58.3 (+1.6) | 59.5 (+1.0) |
| Qwen2.5-VL-7B | Uniform | 53.8 | 58.8 | 62.2 |
| Qwen2.5-VL-7B | BOLT | 57.4 (+3.6) | 60.9 (+2.1) | 64.0 (+1.8) |
| Qwen3-VL-8B | Uniform | 56.0 | 60.5 | 64.3 |
| Qwen3-VL-8B | BOLT | 58.9 (+2.9) | 62.7 (+2.2) | 65.7 (+1.4) |
LongVideoBench Benchmark
| Model | Sampling Method | Acc (8 frames) | Acc (16 frames) | Acc (32 frames) |
|---|---|---|---|---|
| LLaVA-OneVision-7B | Uniform | 54.2 | 56.0 | 56.6 |
| LLaVA-OneVision-7B | BOLT | 54.5 (+0.3) | 56.7 (+0.7) | 58.1 (+1.5) |
| Qwen2.5-VL-7B | Uniform | 53.2 | 56.1 | 58.6 |
| Qwen2.5-VL-7B | BOLT | 55.1 (+1.9) | 57.5 (+1.4) | 60.0 (+1.4) |
| Qwen3-VL-8B | Uniform | 54.8 | 57.7 | 60.4 |
| Qwen3-VL-8B | BOLT | 57.4 (+2.6) | 58.9 (+1.2) | 62.2 (+1.8) |
MLVU Benchmark
| Model | Sampling Method | Acc (8 frames) | Acc (16 frames) | Acc (32 frames) |
|---|---|---|---|---|
| LLaVA-OneVision-7B | Uniform | 58.4 | 60.9 | 63.1 |
| LLaVA-OneVision-7B | BOLT | 63.6 (+5.2) | 66.1 (+5.2) | 66.8 (+3.7) |
| Qwen2.5-VL-7B | Uniform | 57.1 | 59.9 | 62.5 |
| Qwen2.5-VL-7B | BOLT | 63.1 (+6.0) | 66.2 (+6.3) | 68.4 (+5.9) |
| Qwen3-VL-8B | Uniform | 55.3 | 58.9 | 63.8 |
| Qwen3-VL-8B | BOLT | 62.9 (+7.6) | 67.7 (+8.8) | 70.3 (+6.5) |
We provide a simple demo to showcase the inference process of BOLT, including the visualization of selected frames. You can run the demo with the following command:
# download demo video
hf download --repo-type dataset MLVU/MVLU \
--include "MLVU/video/1_plotQA/movie101_58.mp4" \
--local-dir assets
# run demo
python demo.py \
--video_path assets/MLVU/video/1_plotQA/movie101_58.mp4 \
--query "At the end of the video, what happens to the van?"
Please refer to demo.py for more details.
If you find this work helpful, please consider citing our paper:
@inproceedings{liu2025bolt,
title = {BOLT: Boost Large Vision-Language Model Without Training for Long-form Video Understanding},
author = {Liu, Shuming and Zhao, Chen and Xu, Tianqi and Ghanem, Bernard},
booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
month = {June},
year = {2025},
}If you have any questions or suggestions, please feel free to contact us at: shuming.liu@kaust.edu.sa.
