This repo lists recent advantages on VLMs, mainly contributed by Weihan Wang and Ji Qi.
| Model | Vision Enc. | Textual Enc. | Dec. | Multimodal Fusion | Pretraining Objectives | Pretraining Dataset | Published Year |
|---|---|---|---|---|---|---|---|
| ViLBERT | OD->Xformer | Xformer | / | Co-attn | MLM+ITM+MIM | CC3M | 2019 (NIPS) |
| LXMERT | OD+Xformer | Xformer | / | Co-attn | MLM+ITM+MIM+VQA | COCO+VG+VQA | 2019 (Arxiv) |
| VisualBERT | OD | Emb. | / | Merged-attn | MLM+ITM | COCO | 2019 (Arxiv) |
| UNITER | OD | Emb. | / | Merged-attn | MLM+ITM+MIM+WRA | COCO+VG+CC3M+SBU | 2020 (ECCV) |
| VL-BERT | OD | Emb. | / | Merged-attn | MLM+ITM | CC3M | 2020 (ICLR) |
| OSCAR | OD | Emb. | / | Merged-attn | MLM+ITM | 4.1M | 2020 (ECCV) |
| PixelBERT | CNN | Xformer | / | Merged-attn | MLM+ITM | COCO+VG | 2020 (Arxiv) |
| VILLA | OD | Emb. | / | Merged-attn | Adversarial Training+MLM+MIM+ITM | COCO+VG+CC3M+SBU | 2020 (NIPS) |
| ViLBERT-12in1 | OD->Xformer | Xformer | / | Co-attn | Multi Tasks | Multi Datasets | 2020 (CVPR) |
| CLIP | CNN/Xformer | Xformer | / | / | ITC | 400M | 2021 (ICML) |
| ALIGN | CNN | Xformer | / | / | ITC | 1800M | 2021 (ICML) |
| VinVL | OD | Emb. | / | Merged-attn | MLM+ITM | COCO+VG+OI+OBJ365 | 2021 (CVPR) |
| MDETR | CNN | Xformer | √ | Merged-attn | OD+Token Prediction+Contrastive Alignment | COCO+VG+Flickr | 2021 (ICCV) |
| VL-T5 | OD | Emb. | √ | Merged-attn | MLM+ITM+VQA+Grounding+Captioning | COCO+VG | 2021 (ICML) |
| CLIP-VIL | CNN | Emb. | / | Merged-attn | MLM+ITM+VQA | COCO+VG+VQA | 2021 (Arxiv) |
| SOHO | CNN | Emb. | / | Merged-attn | MLM+ITM+MIM | COCO+VG | 2021 (CVPR) |
| VILT | Patch Emb. | Emb. | / | Merged-attn | MLM+ITM | COCO+VG+CC3M+SBU | 2021 (ICCV) |
| ALBEF | Xformer | Xformer | / | Co-attn | MLM+ITM+ITC | COCO+VG+CC12M+SBU | 2021 (NIPS) |
| VLMO | Xformer | Xformer | / | Multiway-attn | MLM+ITM+ITC | 4M/1000M | 2021 (Arxiv) |
| Florence | Xformer | Xformer | / | / | ITC | 900M | 2021 (Arxiv) |
| OFA | CNN | Emb. | √ | Co-attn | Multi Tasks | 20M | 2022 (ICML) |
| METER | Xformer | Xformer | / | Co-attn | MLM+ITM | COCO+VG+CC3M+SBU | 2022 (CVPR) |
| GLIP | Xformer | Xformer | / | Co-attn | OD+Token Prediction+Contrastive Alignment | FourODs+GoldG+Cap24M | 2022 (CVPR) |
| GLIP-v2 | Xformer | Xformer | / | Co-attn | MLM+OD+Token Prediction+Contrastive Alignment | FourODs+GoldG+Cap24M | 2022 (NIPS) |
| SimVLM | CNN | Emb. | / | Merged-attn | PrefixLM | 1800M | 2022 (ICLR) |
| Flamingo | Xformer | Xformer | √ | Co-attn | ITC+Captioning+.. | 1.8B+LTIP+VTP | 2022 (Arxiv) |
| PALI | Xformer | Xformer | √ | Co-attn | Multi Tasks | 10b image+12b text+29b image-ocr | 2022 (Arxiv) |
| FIBER | Xformer | Xformer | / | Co-attn | MLM+ITM+ITC | COCO+VG+CC3M+SBU | 2022 (Arxiv) |
| COCA | Xformer | Xformer | √ | Co-attn | ITC+Captioning | JFT-3B+Align | 2022 (Arxiv) |
| BEIT-3 | Xformer | Xformer | / | Co-attn | MLM | COCO+VG+CC3M+CC12M+SBU | 2022 (Arxiv) |
- I : image inputs
- T : text inputs
- OD : objective detector
- Xformer : transformer
- Emb. : embedding
- MLM : masked language modeling
- MIM : masked image modeling
- ITM : image-text matching
- WRA : word-region alignment
- ITC : image-text contrastive learning
1.CARETS: A Consistency And Robustness Evaluative Test Suite for VQA
(ACL 2022)[paper]
2.VALSE: A Task-Independent Benchmark for Vision and Language Models Centered on Linguistic Phenomena
(ACL 2022)[paper]