Official implementation of:
📄 Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection
👨💻 Zekun Zhang*, Vu Quang Truong*, Minh Hoai (*equal contribution)
🎯 Accepted at ICCV 2025 MMFM Workshop
Method overview.
We propose a low-rank prompt enhancer module to adapt open-vocabulary object detectors (OVDs) like GroundingDINO without changing their backbone or head. This enhancer is:
- Lightweight and parameter-efficient
- Learns to improve prompts using few labeled images
- Integrates easily into Grounded SAM 2 for unseen object instance segmentation (UOIS)
- ✅ Improves GroundingDINO across multiple OVD datasets
- ✅ Outperforms LoRA, LoSA, BitFit, Prompt Tuning, Res-Tuning and full fine-tuning
- ✅ Enables Grounded SAM 2 to achieve SOTA on UOIS with only 50 box-labeled images
- GroundingDINO: https://github.com/IDEA-Research/GroundingDINO
- SAM2: https://github.com/facebookresearch/sam2
You will need to manually download all datasets, extract, and place them at the same directory level as this repository. The expected structure looks like this:
root_dir/
├── PromptAdaptOVD/
├── EgoPER/
├── MSCOCO2017/
├── RarePlanes/
├── PTG/ # EgoPER
├── OIH_VIS/ # HOIST
├── odinw_13/
├── OCID/
├── HouseCat6D/
└── ...
This repository only uses the annotated subset of Scenes100. You must ensure that the folder:
PromptAdaptOVD/images/annotated/
contains all the annotated images and their metadata. If this folder is missing, Scenes100 experiments will not run correctly.
You can download our pretrained enhancer weights here:
➡️ Download Model Weights (Hugging Face)
Place all of the content in the extracted weights folder in the folder PromptAdaptOVD/scripts/groundingdino_baseline.
cd scripts/groundingdino_baseline
bash train_enhancer.sh rank typecd scripts/groundingdino_baseline
bash eval_enhancer.sh rank typeWith rank is the rank of the enhancer and type is the feature attention method, which can be both, image or text.
Please check the scripts/groundingdino_baseline folder for the scripts of other methods (e.g., LoRA, LoSA, Res-Tuning).
| Method | Params % | Scenes100 | EgoPER | HOIST | OV-COCO | RarePlanes | Avg. |
|---|---|---|---|---|---|---|---|
| Base Model | 0% | 30.84 | 24.83 | 17.47 | 19.97 | 41.54 | 26.04 |
| Res-Tuning | 0.06% | 48.59 | 68.05 | 39.61 | 38.04 | 57.36 | 50.33 |
| BitFit | 0.06% | 55.55 | 67.00 | 37.37 | 45.00 | 49.09 | 50.80 |
| LoRA | 0.68% | 55.74 | 67.36 | 37.66 | 44.76 | 52.02 | 51.51 |
| Ours (r=16) | 0.04% | 56.16 | 68.05 | 38.69 | 42.61 | 52.92 | 51.68 |
👉 Our enhancer outperforms all parameter-efficient baselines in the average
| Method | Training Images | Overlap F | Boundary F | % ≥ 75 |
|---|---|---|---|---|
| UCN | 280,000 | 59.4 | 36.5 | 48.0 |
| UOAIS-Net | 53,450 | 67.9 | 62.3 | 73.1 |
| MSMFormer | 53,450 | 70.5 | 64.9 | 75.3 |
| MSMFormer + Refinement | 53,450 | 66.3 | 54.8 | 52.8 |
| UOIS-SAM | 5,345 | 79.9 | 72.5 | 78.3 |
| Ours (r=16) | 50 | 77.2 | 73.7 | 74.0 |
| Method | Input | Training Images | Overlap F | Boundary F | % ≥ 75 |
|---|---|---|---|---|---|
| UCN | RGB | 280,000 | 45.0 | 22.5 | 48.4 |
| UOAIS-Net | RGB | 53,450 | 60.3 | 52.8 | 81.2 |
| MSMFormer | RGB | 53,450 | 67.3 | 57.6 | 80.4 |
| MSMFormer + Refinement | RGB | 53,450 | 66.7 | 54.9 | 71.3 |
| UOIS-SAM | RGB | 5,345 | 70.0 | 66.2 | 84.8 |
| Ours (r=16) | RGB | 50 | 82.7 | 78.9 | 89.7 |
📌 All methods above use RGB-only input. Our approach uses only 50 images with box annotations yet achieves competitive performance, including those trained on thousands of images with mask annotations.
If you find our work useful, please cite:
@inproceedings{zhang2025lowrank,
title = {Low-Rank Prompt Adaptation for Open-Vocabulary Object Detection},
author = {Zekun Zhang and Vu Quang Truong and Minh Hoai},
booktitle = {ICCV Workshop on Multi-modal Foundation Models (MMFM)},
year = {2025}
}- 📧 Vu Quang Truong: vuquang27102001@gmail.com
- 📧 Zekun Zhang: zekzhang@cs.stonybrook.edu
