Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment
The evaluation dataset of the technical paper "A Comprehensive Evaluation of GPT-4V on Knowledge-Intensive Visual Question Answering" is shown in Huggingface Knowledge-intensive Dataset
We released two million Wikipedia Knowledge Datasets in Wikipedia-Knowledge-2M. The dataset includes a JSON file and a compressed archive containing all the image files. The JSON file's image attributes correspond to the compressed archive's image files.
We have also provided the JSON file for the 504K KonwledgeQA dataset in LLaVA-KnowledgeQA-504K. The dataset mainly consists of the training sets from OK-VQA, A-OKVQA, and TextVQA. The images in this dataset come from COCO Caption and TextVQA, which you will need to download yourself.
- Pytorch
2.0.1
conda env create -n CVLM python=3.8
conda activate CVLM
pip install -r requirement.txtBefore you start the pretraining for the visual knowledge aligner, you should place the downloaded Wikipedia-Knowledge-2M dataset in LLaVA/playground/knowledge_data directory.
Then you can use the following scripts for pretraining.
cd LLaVA
export PYTHONPATH=path_to_current_dir
bash scripts/decoder_model/pretrain_knowledge.shReplace pretrain_opt_adapter with the save path of your pretrained VKA.
bash scripts/knowledge/pretrain.shYou should use the code to extract trainable parameters from the saved checkpoints file and store them as inputs in the next stage of training.
Change the attribute pretrain_knowledge_params_path to the path where the parameters extracted in the previous stage are stored.
bash scripts/knowledge_qa/llava_vka_qa.shBesides, after completing the training, you can use the code to extract both trainable non-LoRA parameters and LoRA parameters from the checkpoints.
Finally, we used a two-stage training method when fine-tuning FKA.
bash scripts/knowledge_qa/llava_fka_qa.shbash scripts/knowledge_qa/llava_fka_qa_stage2.shIt is important to note that during each stage of training, the parameters from the previous stage need to be accessed via attribute pretrain_knowledge_params_path, and the parameters should be extraxted by code.
This stage of training also requires loading the training parameters from the Pretraining Visual Knowledge Aligner.
You need to modify attribute pretrain_opt_adapter by your save path.
cd Qwen
bash finetune/pretrain_ds.shbash finetune/finetune_lora_ds.shThe sam_images on GitHub are incomplete; you need to re-download them from Hugging Face.
We released the best model based on LLaVA on CVLM-LLaVA, the best model based on QWen-VL on CVLM-Qwen and pretrain OPT on CVLM-Opt
After downloading checkpoints, organize the weights as follows.
└── LLaVA
├──checkpoints
├──CVLM-LLaVA
└── Qwen
├──checkpoints
├──CVLM-Qwen
├──qwen-pretrain
├──qwen-vka
The evaluation scripts of LLaVA are on scripts/knowledge_qa/eval,
We mainly evaluated six benchmark datasets: OK-VQA, VQAv2, A-OKVQA, TextVQA, InfoSeek, and SEED-Bench.
**Before your evaluation, you should unzip the images generated by SAM.
cd LLaVA\playground\knowledge_qa\sam
tar -xzvf images_all.tar.gzJust so you know, the saved result files will be in the answers_upload folder within the corresponding directory.
bash scripts/knowledge_qa/eval/okvqa.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/okvqa
python okvqa_eval.py --pred_file your_save_pathbash scripts/knowledge_qa/eval/vqav2.sh
cd /data/cxy/Knowledge_LLaVA/upload/playground/knowledge_qa/eval/vqav2
python vqa_eval.py --pred_file your_save_pathEvaluation on open-ended A-OKVQA. The following scripts will also perform the evaluation.
bash scripts/knowledge_qa/eval/aokvqa_oe.shEvaluation on multi-choices A-OKVQA.
bash scripts/knowledge_qa/eval/aokvqa.shEvaluation on TextVQA.
bash scripts/knowledge_qa/eval/textvqa.shEvaluation on InfoSeek.
bash scripts/knowledge_qa/eval/infoseek.shEvaluation on SEED-Bench.
bash scripts/knowledge_qa/eval/seedbench.shThe Qwen model is evaluated using the same datasets as the LLaVA model.
Before you evaluate the Qwen-VL model, you need to download the Qwen-VL model from Qwen-VL and use the two Python files under path to replace the original files.
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset okvqa --few-shot 0python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset vqav2 --few-shot 0Evaluation on open-ended A-OKVQA.
python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0Evaluation on multi-choices A-OKVQA.
python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset aokvqa --few-shot 0python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset textvqa --few-shot 0python eval_mm/evaluate_vqa.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset infoseek --few-shot 0python eval_mm/evaluate_multiple_choice_generated.py --checkpoint checkpoints/CVLM-Qwen/qwen-pretrain --adapter checkpoints/CVLM-Qwen/qwen-vka --dataset seedbench --few-shot 0If you find our paper and code useful in your research, please consider giving a star and citation
@inproceedings{li-etal-2024-cognitive,
title = "Cognitive Visual-Language Mapper: Advancing Multimodal Comprehension with Enhanced Visual Knowledge Alignment",
author = "Li, Yunxin and
Chen, Xinyu and
Hu, Baotian and
Shi, Haoyuan and
Zhang, Min",
booktitle = "Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = aug,
year = "2024",
address = "Bangkok, Thailand",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2024.acl-long.411/",
doi = "10.18653/v1/2024.acl-long.411",
pages = "7615--7626"
}
@article{li2023comprehensive,
title={A comprehensive evaluation of gpt-4v on knowledge-intensive visual question answering},
author={Li, Yunxin and Wang, Longyue and Hu, Baotian and Chen, Xinyu and Zhong, Wanqi and Lyu, Chenyang and Zhang, Min},
journal={arXiv preprint arXiv:2311.07536},
year={2023}
}