MM-LLMs Recent Advances in MultiModal Large Language Models
MM-LLMs Recent Advances in MultiModal Large Language Models
Duzhen Zhang1 * , Yahan Yu2 * , Chenxing Li1 , Jiahua Dong3† , Dan Su1 ,
Chenhui Chu2† and Dong Yu1
1
Tencent AI Lab
2
Kyoto University
3
Shenyang Institute of Automation, Chinese Academy of Sciences
[email protected], [email protected]
Sep. ~ Oct.
In the past year, MultiModal Large Language NExT-GPT MiniGPT-5 LLaVA-1.5 MiniGPT-v2 Fuyu-8B
BLIP-2 Kosmos-1
aimed at facilitating further research of MM- 2023
Apr.
C-Former Q-Former
PaLM LLaMA …
Audio Audio AudioLDM
Audio P-Former
…
HuBERT
"
LLaMA-2 Vicuna " …
… BEATs
… …
Unified
ImageBind
…❄ ❄ ❄
Figure 2: The general model architecture of MM-LLMs and the implementation choices for each component.
Table 1: The summary of 26 mainstream MM-LLMs. I→O: Input to Output Modalities, I: Image, V: Video, A:
Audio, 3D: Point Cloud, and T: Text. In Modality Encoder, “-L” represents Large, “-G” represents Giant, “/14”
indicates a patch size of 14, and “@224” signifies an image resolution of 224 × 224. #.PT and #.IT represent the
scale of the dataset during MM PT and MM IT, respectively. † includes in-house data that is not publicly accessible.
2023d) introduces a simple and unified pre-trained VL outputs for MM generation. The inclusion of
MM-LLM tailored for Referential Dialogue, a task classifier-free guidance during the training phase
involving discussions about regions and objects in enhances the quality of generation.
images. This model demonstrates commendable For introduction regarding the remaining
generalization ability, effectively handling unseen seven MM-LLMs, please refer to Appendix D,
settings. (14) DLP (Jian et al., 2023) proposes the which includes (20) LLaVA-1.5 (Liu et al.,
P-Former to predict the ideal prompt, trained on a 2023d), (21) MiniGPT-v2 (Chen et al.,
dataset of single-modal sentences. This showcases 2023c), (22) CogVLM (Wang et al., 2023),
the feasibility of single-modal training to enhance (23) DRESS (Chen et al., 2023h), (24) X-
MM learning. (15) BuboGPT (Zhao et al., 2023d) InstructBLIP (Panagopoulou et al., 2023), (25)
is a model constructed by learning a shared se- CoDi-2 (Tang et al., 2023a), and (26) VILA (Lin
mantic space for a comprehensive understanding et al., 2023).
of MM content. It explores fine-grained relation-
ships among different modalities such as image, Trends in Existing MM-LLMs: (1) Progressing
text, and audio. (16) ChatSpot (Zhao et al., 2023b) from a dedicated emphasis on MM understanding
introduces a simple yet potent method for finely to the generation of specific modalities and further
adjusting precise referring instructions for MM- evolving into any-to-any modality conversion (e.g.,
LLM, facilitating fine-grained interactions. The MiniGPT-4 → MiniGPT-5 → NExT-GPT); (2) Ad-
incorporation of precise referring instructions, con- vancing from MM PT to SFT and then to RLHF,
sisting of image- and region-level instructions, en- the training pipeline undergoes continuous refine-
hances the integration of multi-grained VL task ment, striving to better align with human intent
descriptions. (17) Qwen-VL (Bai et al., 2023b) is and enhance the model’s conversational interac-
a multi-lingual MM-LLM that supports both En- tion capabilities (e.g., BLIP-2 → InstructBLIP →
glish and Chinese. Qwen-VL also allows the input DRESS); (3) Embracing Diversified Modal Exten-
of multiple images during the training phase, im- sions (e.g., BLIP-2 → X-LLM and InstructBLIP
proving its ability to understand the vision context. → X-InstructBLIP); (4) Incorporating a Higher-
(18) NExT-GPT (Wu et al., 2023d) is an end-to- Quality Training Dataset (e.g., LLaVA → LLaVA-
end, general-purpose any-to-any MM-LLM that 1.5); (5) Adopting a More Efficient Model Architec-
supports the free input and output of image, video, ture, transitioning from complex Q- and P-Former
audio, and text. It employs a lightweight alignment input projector modules in BLIP-2 and DLP to a
strategy, utilizing LLM-centric alignment in the en- simpler yet effective linear projector in VILA.
coding phase and instruction-following alignment
in the decoding phase. (19) MiniGPT-5 (Zheng 5 Benckmarks and Performance
et al., 2023b) is an MM-LLM integrated with inver-
sion to generative vokens and integration with Sta- To offer a comprehensive performance compari-
ble Diffusion. It excels in performing interleaved son, we have compiled a table featuring major
MM-LLMs across 18 VL benchmarks gathered
Model LLM Backbone OKVQA IconVQA VQAv2 GQA VizWiz SQAI VQAT POPE MMEP MMEC MMB MMBCN SEEDI LLaVAW MM-Vet QBench HM VSR
Flamingo Chinchilla-7B 44.7 – – – 28.8 – – – – – – – – – – – 57.0 31.8
BLIP-2 Flan-T5XXL (13B) 45.9 40.6 65.0 44.7 19.6 61.0 42.5 85.3 1293.8 290.0 – – 46.4 38.1 22.4 – 53.7 50.9
LLaVA Vicuna-13B 54.4 43.0 – 41.3 – – 38.9 – – – – – – – – – – 51.2
MiniGPT-4 Vicuna-13B 37.5 37.6 – 30.8 – – 19.4 – – – – – – – – – – 41.6
InstructBLIP Vicuna-7B – – – 49.2 34.5 60.5 50.1 – – – 36.0 23.7 53.4 60.9 26.2 56.7 – –
InstructBLIP Vicuna-13B – 44.8 – 49.5 33.4 63.1 50.7 78.9 1212.8 291.8 – – – 58.2 25.6 – 57.5 52.1
Shikra Vicuna-13B 47.2 – 77.4∗ – – – – – – – 58.8 – – – – 54.7 – –
IDEFICS-9B LLaMA-7B – – 50.9 38.4 35.5 – 25.9 – – – 48.2 25.2 – – – – – –
IDEFICS-80B LLaMA-65B – – 60.0 45.2 36.0 – 30.9 – – – 54.5 38.1 – – – – – –
Qwen-VL Qwen-7B – – 78.8∗ 59.3∗ 35.2 67.1 63.8 – – – 38.2 7.4 56.3 – – 59.4 – –
Qwen-VL-Chat Qwen-7B – – 78.2∗ 57.5∗ 38.9 68.2 61.5 – 1487.5 360.7 60.6 56.7 58.2 – – – – –
LLaVA-1.5 Vicuna-1.5-7B – – 78.5∗ 62.0∗ 50.0 66.8 58.2 85.9 1510.7 316.1‡ 64.3 58.3 58.6 63.4 30.5 58.7 – –
+ShareGPT4V Vicuna-1.5-7B – – 80.6 – 57.2 68.4 – – 1567.4 376.4 68.8 62.2 69.7 72.6 37.6 63.4 – –
LLaVA-1.5 Vicuna-1.5-13B – – 80.0∗ 63.3∗ 53.6 71.6 61.3 85.9 1531.3 295.4‡ 67.7 63.6 61.6 70.7 35.4 62.1 – –
MiniGPT-v2 LLaMA-2-Chat-7B 56.9 47.7 – 60.3 30.3 – 51.9 – – – – – – – – – 58.2 60.6
MiniGPT-v2-Chat LLaMA-2-Chat-7B 55.9 49.4 – 58.8 42.4 – 52.3 – – – – – – – – – 59.5 63.3
VILA-7B LLaMA-2-7B – – 79.9∗ 62.3∗ 57.8 68.2 64.4 85.5 1533.0 – 68.9 61.7 61.1 69.7 34.9 – – –
VILA-13B LLaMA-2-13B – – 80.8∗ 63.3∗ 60.6 73.7 66.6 84.2 1570.1 – 70.3 64.3 62.8 73.0 38.8 – – –
+ShareGPT4V LLaMA-2-13B – – 80.6∗ 63.2∗ 62.4 73.1 65.3 84.8 1556.5 – 70.8 65.4 61.4 78.4 45.7 – – –
Table 2: Comparison of mainstream MM-LLMs on 18 VL benchmarks. The red denotes the highest result, and the
blue denotes the second highest result. ‡ indicates ShareGPT4V’s (Chen et al., 2023e) re-implemented test results
missed in benchmarks or origin papers. ∗ The training images of the datasets are observed during training.
from various papers (Li et al., 2023c; Chen et al., tion (Honovich et al., 2022)) with image-text data
2023c,e; Lin et al., 2023), as shown in Table 2. The during SFT not only addresses the degradation of
information of these benchmarks can be found in text-only tasks but also enhances VL task accuracy.
Appendix E. Next, we will extract essential training
recipes that boost the effectiveness of MM-LLMs, 6 Future Directions
drawing insights from SOTA models.
In this section, we explore promising future direc-
Training Recipes Firstly, higher image resolu- tions for MM-LLMs across the following aspects:
tion can incorporate more visual details for the
model, benefiting tasks that require fine-grained More Powerful Models We can enhance the
details. For example, LLaVA-1.5 and VILA em- MM-LLMs’ strength from the following four key
ploy a resolution of 336 × 336, while Qwen-VL avenues: (1) Expanding Modalities: Current MM-
and MiniGPT-v2 utilize 448 × 448. However, LLMs typically support the following modalities:
higher resolutions lead to longer token sequences, image, video, audio, 3D, and text. However, the
incurring additional training and inference costs. real world involves a broader range of modalities.
MiniGPT-v2 addresses this by concatenating 4 adja- Extending MM-LLMs to accommodate additional
cent visual tokens in the embedding space to reduce modalities (e.g., web pages, heat maps, and fig-
length. Recently, Monkey (Li et al., 2023h) pro- ures&tables) will increase the model’s versatility,
posed a solution to enhance the resolution of input making it more universally applicable; (2) Diver-
images without retraining a high-resolution visual sifying LLMs: Incorporating various types and
encoder, utilizing only a low-resolution visual en- sizes of LLMs provides practitioners with the flexi-
coder, supporting resolutions up to 1300 × 800. To bility to select the most appropriate one based on
enhance the understanding of rich-text images, ta- their specific requirements; (3) Improving MM
bles, and document content, DocPedia (Feng et al., IT Dataset Quality: Current MM IT dataset have
2023) introduced a method to increase the visual ample room for improvement and expansion. Di-
encoder resolution to 2560 × 2560, overcoming versifying the range of instructions can enhance
the limitations of poorly performing low resolu- the effectiveness of MM-LLMs in understanding
tions in open-sourced ViT. Secondly, the incorpo- and executing user commands. (4) Strengthen-
ration of high-quality SFT data can significantly im- ing MM Generation Capabilities: Most current
prove performance in specific tasks, as evidenced MM-LLMs are predominantly oriented towards
by the addition of ShareGPT4V data to LLaVA-1.5 MM understanding. Although some models have
and VILA-13B, as shown in Table 2. Moreover, incorporated MM generation capabilities, the qual-
VILA reveals several key findings: (1) Performing ity of generated responses may be constrained by
PEFT on the LLM Backbone promotes deep em- the capacities of the LDMs. Exploring the inte-
bedding alignment, crucial for ICL; (2) Interleaved gration of retrieval-based approaches (Asai et al.,
Image-Text data proves beneficial, whereas Image- 2023) holds significant promise in complementing
Text pairs alone are sub-optimal; (3) Re-blending the generative process, potentially enhancing the
text-only instruction data (e.g., Unnatural instruc- overall performance of the model.
More Challenging Benchmarks Existing bench- gage with the real world and establishing a closed
marks might not adequately challenge the capa- loop that connects high-level planning with low-
bilities of MM-LLMs, given that many datasets level control. While MM-LLM-based Embodied
have previously appeared to varying degrees in the Intelligence has made advancements in integrat-
PT or IT sets. This implies that the models may ing with robots, further exploration is needed to
have learned these tasks during training. More- enhance the autonomy of robots.
over, current benchmarks predominantly concen-
Continual IT In practical applications, MM-
trate on the VL sub-field. Thus, it is crucial for
LLMs are expected to adapt to new MM tasks
the development of MM-LLMs to construct a more
for supporting additional functionalities. Never-
challenging, larger-scale benchmark that includes
theless, current MM-LLMs remain static and are
more modalities and uses a unified evaluation stan-
unable to adjust to continuously emerging require-
dard. Concurrently, benchmarks can be tailored to
ments. Therefore, an approach is needed to make
assess the MM-LLMs’ proficiency in practical ap-
the model flexible enough to efficiently and con-
plications. For instance, the introduction of GOAT-
tinually leverage emerging data, while avoiding
Bench (Lin et al., 2024) aims to evaluate various
the substantial cost of retraining MM-LLMs. This
MM-LLMs’ capacity to discern and respond to nu-
aligns with the principles of continual learning,
anced aspects of social abuse presented in memes.
where models are designed to incrementaly learn
Mobile/Lightweight Deployment To deploy new tasks similar to human learning. Continual
MM-LLMs on resource-constrained platforms and IT aims to continuously fine-tune MM-LLMs for
achieve optimal performance meanwhile, such as new MM tasks while maintaining superior perfor-
low-power mobile and IoT devices, lightweight mance on tasks learned during the original MM IT
implementations are of paramount importance. stage. It introduces two primary challenges: (1)
A notable advancement in this realm is Mo- catastrophic forgetting, where models forget previ-
bileVLM (Chu et al., 2023a). This approach strate- ous knowledge when learning new tasks (Robins,
gically downscales LLaMA, allowing for seam- 1995; McCloskey and Cohen, 1989; Goodfellow
less off-the-shelf deployment. MobileVLM further et al., 2013; Zhang et al., 2023d,c,b; Zheng et al.,
introduces a Lightweight Downsample Projector, 2023a), and (2) negative forward transfer, indicat-
consisting of fewer than 20 million parameters, con- ing that the performance of unseen tasks is declined
tributing to improved computational speed. Never- when learning new ones (Zheng et al., 2024; Dong
theless, this avenue necessitates additional explo- et al., 2023b,a). Recently, He et al. established a
ration for further advancements in development. benchmark to facilitate the development of contin-
ual IT for MM-LLMs. Despite these advancements,
Embodied Intelligence The embodied intelli-
there is still a significant opportunity and room for
gence aims to replicate human-like perception and
improvement in developing better methods to ad-
interaction with the surroundings by effectively
dress the challenges of catastrophic forgetting and
understanding the environment, recognizing perti-
negative forward transfer.
nent objects, assessing their spatial relationships,
and devising a comprehensive task plan (Firoozi 7 Conclusion
et al., 2023). Embodied AI tasks, such as embod-
ied planning, embodied visual question answer- In this paper, we have presented a comprehensive
ing, and embodied control, equips robots to au- survey of MM-LLMs with a focus on recent ad-
tonomously implement extended plans by leverag- vancements. Initially, we categorize the model
ing real-time observations. Some typical work in architecture into five components, providing a de-
this area is PaLM-E (Driess et al., 2023) and Em- tailed overview of general design formulations and
bodiedGPT (Mu et al., 2023). PaLM-E introduces training pipelines. Subsequently, we introduce var-
a multi-embodiment agent through the training of ious SOTA MM-LLMs, each distinguished by its
a MM-LLM. Beyond functioning solely as an em- specific formulations. Our survey also sheds light
bodied decision maker, PaLM-E also demonstrates on their capabilities across diverse MM bench-
proficiency in handling general VL tasks. Em- marks and envisions future developments in this
bodiedGPT introduces an economically efficient rapidly evolving field. We hope this survey can
method characterized through a CoT approach, en- provide insights for researchers, contributing to the
hancing the capability of embodied agents to en- ongoing advancements in the MM-LLMs domain.
Limitations training large autoregressive vision-language models.
arXiv preprint arXiv:2308.01390.
In this paper, we embark on a comprehensive explo-
ration of the current MM-LLMs landscape, present- Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
ing a synthesis from diverse perspectives enriched Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023a. Qwen technical report. arXiv
by our insights. Acknowledging the dynamic na- preprint arXiv:2309.16609.
ture of this field, it is plausible that certain aspects
may have eluded our scrutiny, and recent advances Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
might not be entirely encapsulated. To tackle this Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
and Jingren Zhou. 2023b. Qwen-VL: A Frontier
inherent challenge, we’ve established a dedicated Large Vision-Language Model with Versatile Abili-
website for real-time tracking, using crowdsourc- ties. CoRR, abs/2308.12966.
ing to capture the latest advancements. Our goal is
for this platform to evolve into a continuous source Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zis-
serman. 2021. Frozen in time: A joint video and
of contributions propelling ongoing development
image encoder for end-to-end retrieval. In Proceed-
in the field. Given the constraints of page limits, ings of the IEEE/CVF International Conference on
we are unable to delve into all technical details and Computer Vision, pages 1728–1738.
have provided concise overviews of the core contri-
butions of mainstream MM-LLMs. Looking ahead, Rohan Bavishi, Erich Elsen, Curtis Hawthorne,
Maxwell Nye, Augustus Odena, Arushi Somani, and
we commit to vigilant monitoring and continual Sağnak Taşırlar. 2023. Introducing our Multimodal
enhancement of relevant details on our website, Models.
incorporating fresh insights as they emerge.
Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar
Appalaraju, and R Manmatha. 2022. Latr: Layout-
References aware transformer for scene-text vqa. In Proceedings
of the IEEE/CVF conference on computer vision and
Emanuele Aiello, Lili Yu, Yixin Nie, Armen Agha- pattern recognition, pages 16548–16558.
janyan, and Barlas Oguz. 2023. Jointly Training
Large Autoregressive Multimodal Models. arXiv Andy Brock, Soham De, Samuel L Smith, and Karen Si-
preprint arXiv:2309.15564. monyan. 2021. High-performance large-scale image
recognition without normalization. In International
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Conference on Machine Learning, pages 1059–1071.
Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. PMLR.
2021. Vatt: Transformers for multimodal self-
supervised learning from raw video, audio and text. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Advances in Neural Information Processing Systems, Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
34:24206–24221. Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, learners. Advances in neural information processing
Ashwin Kalyan, Peter Clark, Derry Wijaya, and systems, 33:1877–1901.
Niket Tandon. 2023. RL4F: Generating Natu-
ral Language Feedback with Reinforcement Learn- Minwoo Byeon, Beomhee Park, Haecheon Kim,
ing for Repairing Model Outputs. arXiv preprint Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.
arXiv:2305.08844. 2022. Coyo-700m: Image-text pair dataset.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,
Antoine Miech, Iain Barr, Yana Hasson, Karel Fabian Caba Heilbron, Victor Escorcia, Bernard
Lenc, Arthur Mensch, Katherine Millican, Malcolm Ghanem, and Juan Carlos Niebles. 2015. Activitynet:
Reynolds, et al. 2022. Flamingo: a visual language A large-scale video benchmark for human activity
model for few-shot learning. Advances in Neural understanding. In Proceedings of the ieee conference
Information Processing Systems, 35:23716–23736. on computer vision and pattern recognition, pages
961–970.
Akari Asai, Sewon Min, Zexuan Zhong, and Danqi
Chen. 2023. Retrieval-based language models and Cerspense. 2023. Zeroscope: Diffusion-based text-to-
applications. In Proceedings of the 61st Annual Meet- video synthesis.
ing of the Association for Computational Linguistics
(Volume 6: Tutorial Abstracts), pages 41–46. Soravit Changpinyo, Piyush Sharma, Nan Ding, and
Radu Soricut. 2021. Conceptual 12m: Pushing web-
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- scale image-text pre-training to recognize long-tail
sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, visual concepts. In Proceedings of the IEEE/CVF
Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Conference on Computer Vision and Pattern Recog-
2023. Openflamingo: An open-source framework for nition, pages 3558–3568.
Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu- with humans via natural language feedback. arXiv
Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023a. preprint arXiv:2311.10081.
Vlp: A survey on vision-language pre-training. Ma-
chine Intelligence Research, 20(1):38–56. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Zhang, Jing Shi, Shuang Xu, and Bo Xu. 2023b. X- Stoica, and Eric P. Xing. 2023. Vicuna: An Open-
llm: Bootstrapping advanced large language models Source Chatbot Impressing GPT-4 with 90%* Chat-
by treating multi-modalities as foreign languages. GPT Quality.
arXiv preprint arXiv:2305.04160.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Liu, Pengchuan Zhang, Raghuraman Krishnamoor- Barham, Hyung Won Chung, Charles Sutton, Sebas-
thi, Vikas Chandra, Yunyang Xiong, and Mohamed tian Gehrmann, et al. 2023. Palm: Scaling language
Elhoseiny. 2023c. Minigpt-v2: large language model modeling with pathways. Journal of Machine Learn-
as a unified interface for vision-language multi-task ing Research, 24(240):1–113.
learning. arXiv preprint arXiv:2310.09478.
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu
Feng Zhu, and Rui Zhao. 2023d. Shikra: Unleash- Zhang, Bo Zhang, Xiaolin Wei, et al. 2023a. Mo-
ing Multimodal LLM’s Referential Dialogue Magic. bilevlm: A fast, reproducible and strong vision lan-
arXiv preprint arXiv:2306.15195. guage assistant for mobile devices. arXiv preprint
arXiv:2312.16886.
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Con-
ghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shil-
2023e. ShareGPT4V: Improving Large Multi- iang Zhang, Zhijie Yan, Chang Zhou, and Jingren
Modal Models with Better Captions. arXiv preprint Zhou. 2023b. Qwen-audio: Advancing universal
arXiv:2311.12793. audio understanding via unified large-scale audio-
language models. arXiv preprint arXiv:2311.07919.
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu,
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xi-
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
angzhan Yu, and Furu Wei. 2023f. BEATs: Audio
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Pre-Training with Acoustic Tokenizers. In Interna-
2022. Scaling instruction-finetuned language models.
tional Conference on Machine Learning, ICML 2023,
arXiv preprint arXiv:2210.11416.
23-29 July 2023, Honolulu, Hawaii, USA, pages
5178–5193. Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang
Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zi-
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, chong Yang, Kuei-Da Liao, et al. 2024. A sur-
Yibing Song, Jue Wang, and Ping Luo. 2022a. Adapt- vey on multimodal large language models for au-
former: Adapting vision transformers for scalable tonomous driving. In Proceedings of the IEEE/CVF
visual recognition. Advances in Neural Information Winter Conference on Applications of Computer Vi-
Processing Systems, 35:16664–16678. sion, pages 958–979.
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Wenliang Dai, Junnan Li, Dongxu Li, Anthony
Mustafa, Soravit Changpinyo, Jialin Wu, Car- Meng Huat Tiong, Junqi Zhao, Weisheng Wang,
los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi.
Yi Tay, et al. 2023g. PaLI-X: On Scaling up a Multi- 2023. InstructBLIP: Towards General-purpose
lingual Vision and Language Model. arXiv preprint Vision-Language Models with Instruction Tuning.
arXiv:2305.18565. In Thirty-seventh Conference on Neural Information
Processing Systems.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier-
giovanni, Piotr Padlewski, Daniel Salz, Sebastian Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Goodman, Adam Grycner, Basil Mustafa, Lucas Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
Beyer, et al. 2022b. Pali: A jointly-scaled mul- of quantized llms. arXiv preprint arXiv:2305.14314.
tilingual language-image model. arXiv preprint
arXiv:2209.06794. Jiahua Dong, Wenqi Liang, Yang Cong, and Gan Sun.
2023a. Heterogeneous forgetting compensation for
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- class-incremental learning. In Proceedings of the
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and IEEE/CVF International Conference on Computer
C Lawrence Zitnick. 2015. Microsoft coco captions: Vision, pages 11742–11751.
Data collection and evaluation server. arXiv preprint
arXiv:1504.00325. Jiahua Dong, Duzhen Zhang, Yang Cong, Wei Cong,
Henghui Ding, and Dengxin Dai. 2023b. Federated
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Incremental Semantic Segmentation. In Proceedings
Ji, and Ajay Divakaran. 2023h. Dress: Instructing of the IEEE/CVF Conference on Computer Vision
large vision-language models to align and interact and Pattern Recognition, pages 3934–3943.
Linhao Dong and Bo Xu. 2020. Cif: Continuous Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-
integrate-and-fire for end-to-end speech recognition. Yi Lee. 2022. AdapterBias: Parameter-efficient
In ICASSP 2020-2020 IEEE International Confer- Token-dependent Representation Shift for Adapters
ence on Acoustics, Speech and Signal Processing in NLP Tasks. In Findings of the Association for
(ICASSP), pages 6079–6083. IEEE. Computational Linguistics: NAACL 2022, pages
2608–2621.
Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang,
Thomas Unterthiner, Mostafa Dehghani, Matthias Jonathan Hayase, Georgios Smyrnis, Thao Nguyen,
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. Ryan Marten, Mitchell Wortsman, Dhruba Ghosh,
An Image is Worth 16x16 Words: Transformers Jieyu Zhang, et al. 2023. Datacomp: In search of
for Image Recognition at Scale. In International the next generation of multimodal datasets. arXiv
Conference on Learning Representations. preprint arXiv:2304.14108.
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man-
Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, nat Singh, Kalyan Vasudev Alwala, Armand Joulin,
Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. and Ishan Misra. 2023. Imagebind: One embed-
2023. Palm-e: An embodied multimodal language ding space to bind them all. In Proceedings of the
model. arXiv preprint arXiv:2303.03378. IEEE/CVF Conference on Computer Vision and Pat-
Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. tern Recognition, pages 15180–15190.
2022a. A Survey of Vision-Language Pre-Trained
Models. In Proceedings of the Thirty-First Inter- Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang,
national Joint Conference on Artificial Intelligence, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang,
IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A
5436–5443. vision and language model for dialogue with humans.
arXiv preprint arXiv:2305.04790.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron
GLM: General Language Model Pretraining with Au- Courville, and Yoshua Bengio. 2013. An em-
toregressive Blank Infilling. In Proceedings of the pirical investigation of catastrophic forgetting in
60th Annual Meeting of the Association for Compu- gradient-based neural networks. arXiv preprint
tational Linguistics (Volume 1: Long Papers), pages arXiv:1312.6211.
320–335.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Batra, and Devi Parikh. 2017. Making the v in vqa
2021. Clip2video: Mastering video-text retrieval via matter: Elevating the role of image understanding
image clip. arXiv preprint arXiv:2106.11097. in visual question answering. In Proceedings of the
IEEE conference on computer vision and pattern
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell recognition, pages 6904–6913.
Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang,
and Yue Cao. 2023. Eva: Exploring the limits of Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu
masked visual representation learning at scale. In Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang,
Proceedings of the IEEE/CVF Conference on Com- Wei Zhang, Xin Jiang, et al. 2022. Wukong: A 100
puter Vision and Pattern Recognition, pages 19358– million large-scale chinese cross-modal pre-training
19369. benchmark. Advances in Neural Information Pro-
cessing Systems, 35:26418–26431.
Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang
Li, and Can Huang. 2023. DocPedia: Unleashing
the Power of Large Multimodal Model in the Fre- Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,
quency Domain for Versatile Document Understand- Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
ing. arXiv preprint arXiv:2311.11810. Bigham. 2018. Vizwiz grand challenge: Answering
visual questions from blind people. In Proceedings of
Roya Firoozi, Johnathan Tucker, Stephen Tian, the IEEE conference on computer vision and pattern
Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke recognition, pages 3608–3617.
Zhu, Shuran Song, Ashish Kapoor, Karol Hausman,
et al. 2023. Foundation Models in Robotics: Appli- Jinghan He, Haiyun Guo, Ming Tang, and Jinqiao Wang.
cations, Challenges, and the Future. arXiv preprint 2023. Continual instruction tuning for large multi-
arXiv:2312.07843. modal models. arXiv preprint arXiv:2311.16206.
Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Kirkpatrick, and Graham Neubig. 2021. Towards a
Ke Li, Xing Sun, et al. 2023. Mme: A comprehensive Unified View of Parameter-Efficient Transfer Learn-
evaluation benchmark for multimodal large language ing. In International Conference on Learning Repre-
models. arXiv preprint arXiv:2306.13394. sentations.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru,
Sun. 2016. Deep residual learning for image recog- Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shus-
nition. In Proceedings of the IEEE conference on ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al.
computer vision and pattern recognition, pages 770– 2022. Opt-iml: Scaling language model instruc-
778. tion meta learning through the lens of generalization.
arXiv preprint arXiv:2212.12017.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
ford, Diego de Las Casas, Lisa Anne Hendricks, Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen
Johannes Welbl, Aidan Clark, et al. 2022. Train- Li, and Tom Duerig. 2021. Scaling up visual and
ing compute-optimal large language models. arXiv vision-language representation learning with noisy
preprint arXiv:2203.15556. text supervision. In International conference on ma-
chine learning, pages 4904–4916. PMLR.
Or Honovich, Thomas Scialom, Omer Levy, and Timo
Schick. 2022. Unnatural instructions: Tuning lan- Yiren Jian, Chongyang Gao, and Soroush Vosoughi.
guage models with (almost) no human labor. arXiv 2023. Bootstrapping Vision-Language Learning with
preprint arXiv:2212.09689. Decoupled Language Pre-training. In Thirty-seventh
Conference on Neural Information Processing Sys-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, tems.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Kushal Kafle, Brian Price, Scott Cohen, and Christo-
Parameter-efficient transfer learning for NLP. In In- pher Kanan. 2018. Dvqa: Understanding data visual-
ternational Conference on Machine Learning, pages izations via question answering. In Proceedings of
2790–2799. PMLR. the IEEE conference on computer vision and pattern
recognition, pages 5648–5656.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- Rabeeh Karimi Mahabadi, James Henderson, and Se-
rahman Mohamed. 2021. Hubert: Self-supervised bastian Ruder. 2021. Compacter: Efficient low-rank
speech representation learning by masked prediction hypercomplex adapter layers. Advances in Neural
of hidden units. IEEE/ACM Transactions on Audio, Information Processing Systems, 34:1022–1035.
Speech, and Language Processing, 29:3451–3460. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,
and Tamara Berg. 2014. Referitgame: Referring to
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu,
objects in photographs of natural scenes. In Proceed-
Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen,
ings of the 2014 conference on empirical methods in
et al. 2021. LoRA: Low-Rank Adaptation of Large
natural language processing (EMNLP), pages 787–
Language Models. In International Conference on
798.
Learning Representations.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, and Toutanova. 2019. BERT: Pre-training of Deep Bidi-
Shijian Lu. 2023a. Visual Instruction Tuning to- rectional Transformers for Language Understanding.
wards General-Purpose Multimodal Model: A Sur- In Proceedings of NAACL-HLT, pages 4171–4186.
vey. arXiv preprint arXiv:2312.16602.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj
Rongjie Huang, Mingze Li, Dongchao Yang, Jia- Goswami, Amanpreet Singh, Pratik Ringshia, and
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Davide Testuggine. 2020. The hateful memes chal-
Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. lenge: Detecting hate speech in multimodal memes.
2023b. Audiogpt: Understanding and generating Advances in neural information processing systems,
speech, music, sound, and talking head. arXiv 33:2611–2624.
preprint arXiv:2304.12995.
Diederik P Kingma and Max Welling. 2013. Auto-
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, encoding variational bayes. arXiv preprint
Saksham Singhal, Shuming Ma, Tengchao Lv, Lei arXiv:1312.6114.
Cui, Owais Khan Mohammed, Qiang Liu, et al.
2023c. Language is not all you need: Aligning Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
perception with language models. arXiv preprint son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
arXiv:2302.14045. Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
2017. Visual genome: Connecting language and vi-
Drew A Hudson and Christopher D Manning. 2019. sion using crowdsourced dense image annotations.
Gqa: A new dataset for real-world visual reasoning International journal of computer vision, 123:32–73.
and compositional question answering. In Proceed-
ings of the IEEE/CVF conference on computer vision Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
and pattern recognition, pages 6700–6709. The Power of Scale for Parameter-Efficient Prompt
Tuning. In Proceedings of the 2021 Conference on
IDEFICS. 2023. Introducing IDEFICS: An Open Repro- Empirical Methods in Natural Language Processing,
duction of State-of-the-Art Visual Language Model. pages 3045–3059.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo
Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and
Liu. 2023a. Mimic-it: Multi-modal in-context in- Xiang Bai. 2023h. Monkey: Image Resolution and
struction tuning. arXiv preprint arXiv:2306.05425. Text Label Are Important Things for Large Multi-
modal Models. arXiv preprint arXiv:2311.06607.
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix-
iao Ge, and Ying Shan. 2023b. Seed-bench: Bench- Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang,
marking multimodal llms with generative compre- and Jing Ma. 2024. GOAT-Bench: Safety Insights
hension. arXiv preprint arXiv:2307.16125. to Large Multimodal Models through Meme-Based
Social Abuse. arXiv preprint arXiv:2401.01523.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H.
Hoi. 2023c. BLIP-2: Bootstrapping Language-Image Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo
Pre-training with Frozen Image Encoders and Large Molchanov, Andrew Tao, Huizi Mao, Jan Kautz,
Language Models. In International Conference on Mohammad Shoeybi, and Song Han. 2023. VILA:
Machine Learning, ICML 2023, 23-29 July 2023, On Pre-training for Visual Language Models. arXiv
Honolulu, Hawaii, USA, pages 19730–19742. preprint arXiv:2312.07533.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven
Hoi. 2022. Blip: Bootstrapping language-image pre- Tsung-Yi Lin, Michael Maire, Serge Belongie, James
training for unified vision-language understanding Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and generation. In International Conference on Ma- and C Lawrence Zitnick. 2014. Microsoft coco:
chine Learning, pages 12888–12900. PMLR. Common objects in context. In Computer Vision–
ECCV 2014: 13th European Conference, Zurich,
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Switzerland, September 6-12, 2014, Proceedings,
Shafiq Joty, Caiming Xiong, and Steven Chu Hong Part V 13, pages 740–755. Springer.
Hoi. 2021. Align before fuse: Vision and language
representation learning with momentum distillation. Fangyu Liu, Guy Emerson, and Nigel Collier. 2023a.
Advances in neural information processing systems, Visual spatial reasoning. Transactions of the Associ-
34:9694–9705. ation for Computational Linguistics, 11:635–651.
KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo
hai Wang, Ping Luo, Yali Wang, Limin Wang, and Liu, Danilo P. Mandic, Wenwu Wang, and Mark D.
Yu Qiao. 2023d. Videochat: Chat-centric video un- Plumbley. 2023b. AudioLDM: Text-to-Audio Gener-
derstanding. arXiv preprint arXiv:2305.06355. ation with Latent Diffusion Models. In International
Conference on Machine Learning, ICML 2023, 23-
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi 29 July 2023, Honolulu, Hawaii, USA, pages 21450–
Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, 21474.
Jingjing Xu, Xu Sun, et al. 2023e. M3 IT: A Large-
Scale Dataset towards Multi-Modal Multilingual In- Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao
struction Tuning. arXiv preprint arXiv:2306.04387. Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang,
Yuxuan Wang, and Mark D. Plumbley. 2023c. Audi-
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning:
oLDM 2: Learning Holistic Audio Generation with
Optimizing Continuous Prompts for Generation. In
Self-supervised Pretraining. CoRR, abs/2308.05734.
Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae
International Joint Conference on Natural Language Lee. 2023d. Improved Baselines with Visual Instruc-
Processing (Volume 1: Long Papers), pages 4582– tion Tuning. In NeurIPS 2023 Workshop on Instruc-
4597. tion Tuning and Instruction Following.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,
Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object- Lee. 2023e. Visual Instruction Tuning. In Thirty-
semantics aligned pre-training for vision-language seventh Conference on Neural Information Process-
tasks. In Computer Vision–ECCV 2020: 16th Euro- ing Systems.
pean Conference, Glasgow, UK, August 23–28, 2020,
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li,
Proceedings, Part XXX 16, pages 121–137. Springer.
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Wang, Conghui He, Ziwei Liu, et al. 2023f. Mm-
Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and bench: Is your multi-modal model an all-around
Yunchao Wei. 2023f. Stablellava: Enhanced visual player? arXiv preprint arXiv:2307.06281.
instruction tuning with synthesized image-dialogue
data. arXiv preprint arXiv:2308.10253. Siqu Long, Feiqi Cao, Soyeon Caren Han, and Haiqin
Yang. 2022. Vision-and-Language Pretrained Mod-
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, els: A Survey. In Proceedings of the Thirty-First
Wayne Xin Zhao, and Ji-Rong Wen. 2023g. Eval- International Joint Conference on Artificial Intelli-
uating object hallucination in large vision-language gence, IJCAI 2022, Vienna, Austria, 23-29 July 2022,
models. arXiv preprint arXiv:2305.10355. pages 5530–5537.
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- OpenAI. 2023. GPT-4 Technical Report.
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter
Clark, and Ashwin Kalyan. 2022. Learn to explain: Vicente Ordonez, Girish Kulkarni, and Tamara Berg.
Multimodal reasoning via thought chains for science 2011. Im2text: Describing images using 1 million
question answering. Advances in Neural Information captioned photographs. Advances in neural informa-
Processing Systems, 35:2507–2521. tion processing systems, 24.
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Zhu. 2021. Iconqa: A new benchmark for abstract Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
diagram understanding and visual language reason- 2022. Training language models to follow instruc-
ing. In Thirty-fifth Conference on Neural Information tions with human feedback. Advances in Neural
Processing Systems Datasets and Benchmarks Track Information Processing Systems, 35:27730–27744.
(Round 2).
Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li,
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese,
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- Caiming Xiong, and Juan Carlos Niebles. 2023. X-
wei Lin, and Daxin Jiang. 2023. WizardCoder: Em- InstructBLIP: A Framework for aligning X-Modal
powering Code Large Language Models with Evol- instruction-aware representations to LLMs and Emer-
Instruct. arXiv preprint arXiv:2306.08568. gent Cross-modal Reasoning. arXiv preprint
Muhammad Maaz, Hanoona Rasheed, Salman Khan, arXiv:2311.18799.
and Fahad Shahbaz Khan. 2023. Video-ChatGPT:
Towards Detailed Video Understanding via Large Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao,
Vision and Language Models. arXiv preprint Shaohan Huang, Shuming Ma, and Furu Wei.
arXiv:2306.05424. 2023. Kosmos-2: Grounding Multimodal Large
Language Models to the World. arXiv preprint
Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- arXiv:2306.14824.
har. 2021. Docvqa: A dataset for vqa on document
images. In Proceedings of the IEEE/CVF winter con- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ference on applications of computer vision, pages Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
2200–2209. try, Amanda Askell, Pamela Mishkin, Jack Clark,
et al. 2021. Learning transferable visual models from
Michael McCloskey and Neal J Cohen. 1989. Catas- natural language supervision. In International confer-
trophic interference in connectionist networks: The ence on machine learning, pages 8748–8763. PMLR.
sequential learning problem. In Psychology of learn-
ing and motivation, volume 24, pages 109–165. Else- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
vier. man, Christine McLeavey, and Ilya Sutskever. 2023.
Robust Speech Recognition via Large-Scale Weak
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Supervision. In International Conference on Ma-
Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, chine Learning, ICML 2023, 23-29 July 2023, Hon-
Yuexian Zou, and Wenwu Wang. 2023. Wavcaps: olulu, Hawaii, USA, pages 28492–28518.
A chatgpt-assisted weakly-labelled audio caption-
ing dataset for audio-language multimodal research. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
arXiv preprint arXiv:2303.17395. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, of transfer learning with a unified text-to-text trans-
and Anirban Chakraborty. 2019. Ocr-vqa: Visual former. The Journal of Machine Learning Research,
question answering by reading text in images. In 21(1):5485–5551.
2019 international conference on document analysis
and recognition (ICDAR), pages 947–952. IEEE. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Vedaldi. 2017. Learning multiple visual domains
Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, with residual adapters. Advances in neural informa-
Yu Qiao, and Ping Luo. 2023. Embodiedgpt: Vision- tion processing systems, 30.
language pre-training via embodied chain of thought.
In Thirty-seventh Conference on Neural Information Anthony Robins. 1995. Catastrophic forgetting, re-
Processing Systems. hearsal and pseudorehearsal. Connection Science,
7(2):123–146.
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham-
mad Saqib, Saeed Anwar, Muhammad Usman, Nick Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Barnes, and Ajmal Mian. 2023. A comprehensive Patrick Esser, and Björn Ommer. 2022. High-
overview of large language models. arXiv preprint resolution image synthesis with latent diffusion mod-
arXiv:2307.06435. els. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages
OpenAI. 2022. OpenAI: Introducing ChatGPT. 10684–10695.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Shezheng Song, Xiaopeng Li, and Shasha Li. 2023.
2015. U-net: Convolutional networks for biomedical How to Bridge the Gap between Modalities: A Com-
image segmentation. In Medical Image Computing prehensive Survey on Multimodal Large Language
and Computer-Assisted Intervention–MICCAI 2015: Model. arXiv preprint arXiv:2311.07594.
18th International Conference, Munich, Germany,
October 5-9, 2015, Proceedings, Part III 18, pages Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan
234–241. Springer. Wang, and Deng Cai. 2023. Pandagpt: One
model to instruction-follow them all. arXiv preprint
Ludan Ruan and Qin Jin. 2022. Survey: Transformer arXiv:2305.16355.
based video-language pre-training. AI Open, 3:1–13.
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu,
Salesforce. 2022. Ulip. Chunyuan Li, Yikang Shen, Chuang Gan, Liang-
Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023.
Christoph Schuhmann, Romain Beaumont, Richard Aligning large multimodal models with factually aug-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, mented rlhf. arXiv preprint arXiv:2309.14525.
Theo Coombes, Aarush Katta, Clayton Mullis,
Mitchell Wortsman, et al. 2022. Laion-5b: An open Dídac Surís, Sachit Menon, and Carl Vondrick. 2023.
large-scale dataset for training next generation image- Vipergpt: Visual inference via python execution for
text models. Advances in Neural Information Pro- reasoning. arXiv preprint arXiv:2303.08128.
cessing Systems, 35:25278–25294.
Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu,
Christoph Schuhmann, Andreas Köpf, Richard Vencu, Chenguang Zhu, and Mohit Bansal. 2023a. CoDi-2:
Theo Coombes, and Romain Beaumont. 2022b. In-Context, Interleaved, and Interactive Any-to-Any
Laion coco: 600m synthetic captions from laion2b- Generation. arXiv preprint arXiv:2311.18775.
en.
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng,
Christoph Schuhmann, Richard Vencu, Romain Beau-
and Mohit Bansal. 2023b. Any-to-Any Generation
mont, Robert Kaczmarczyk, Clayton Mullis, Aarush
via Composable Diffusion. In Thirty-seventh Confer-
Katta, Theo Coombes, Jenia Jitsev, and Aran Komat-
ence on Neural Information Processing Systems.
suzaki. 2021. Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint
arXiv:2111.02114. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Gar-
cia, Jason Wei, Xuezhi Wang, Hyung Won Chung,
Dustin Schwenk, Apoorv Khandelwal, Christopher Dara Bahri, Tal Schuster, Steven Zheng, et al. 2022.
Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. Ul2: Unifying language learning paradigms. In The
A-okvqa: A benchmark for visual question answer- Eleventh International Conference on Learning Rep-
ing using world knowledge. In European Conference resentations.
on Computer Vision, pages 146–162. Springer.
Gemini Team, Rohan Anil, Sebastian Borgeaud,
Piyush Sharma, Nan Ding, Sebastian Goodman, and Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut. 2018. Conceptual captions: A cleaned, Radu Soricut, Johan Schalkwyk, Andrew M Dai,
hypernymed, image alt-text dataset for automatic im- Anja Hauth, et al. 2023. Gemini: a family of
age captioning. In Proceedings of the 56th Annual highly capable multimodal models. arXiv preprint
Meeting of the Association for Computational Lin- arXiv:2312.11805.
guistics (Volume 1: Long Papers), pages 2556–2565.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Weiming Lu, and Yueting Zhuang. 2023. Hugging- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
gpt: Solving ai tasks with chatgpt and its friends in Azhar, et al. 2023a. Llama: Open and effi-
huggingface. arXiv preprint arXiv:2303.17580. cient foundation language models. arXiv preprint
arXiv:2302.13971.
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and
Amanpreet Singh. 2020. Textcaps: a dataset for im- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
age captioning with reading comprehension. In Com- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
puter Vision–ECCV 2020: 16th European Confer- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
ence, Glasgow, UK, August 23–28, 2020, Proceed- Bhosale, et al. 2023b. Llama 2: Open founda-
ings, Part II 16, pages 742–758. Springer. tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Amanpreet Singh, Vivek Natarajan, Meet Shah,
Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Marcus Rohrbach. 2019. Towards vqa models Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
that can read. In Proceedings of the IEEE/CVF con- Kaiser, and Illia Polosukhin. 2017. Attention is all
ference on computer vision and pattern recognition, you need. Advances in neural information processing
pages 8317–8326. systems, 30.
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Rongtao Xu, Changwei Wang, Jiguang Zhang, Shibiao
Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Xu, Weiliang Meng, and Xiaopeng Zhang. 2023b.
Zhou, and Hongxia Yang. 2022a. Ofa: Unifying ar- Rssformer: Foreground saliency enhancement for re-
chitectures, tasks, and modalities through a simple mote sensing land-cover segmentation. IEEE Trans-
sequence-to-sequence learning framework. In Inter- actions on Image Processing, 32:1052–1064.
national Conference on Machine Learning, pages
23318–23340. PMLR. Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng
Wang, Xudong Lin, Guanyu Cai, and Jinhui Tang.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi 2021. Video-text pre-training with learned regions.
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei arXiv preprint arXiv:2112.01194.
Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual ex-
pert for pretrained language models. arXiv preprint Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath
arXiv:2311.03079. Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi,
and Junzhou Huang. 2022. Vision-language pre-
Wenhui Wang, Hangbo Bao, Li Dong, Johan training with triple contrastive learning. In Proceed-
Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, ings of the IEEE/CVF Conference on Computer Vi-
Owais Khan Mohammed, Saksham Singhal, Subhojit sion and Pattern Recognition, pages 15671–15680.
Som, et al. 2022b. Image as a foreign language: Beit
pretraining for all vision and vision-language tasks. Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin
arXiv preprint arXiv:2208.10442. Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu,
Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, react: Prompting chatgpt for multimodal reasoning
Adams Wei Yu, Brian Lester, Nan Du, Andrew M and action. arXiv preprint arXiv:2303.11381.
Dai, and Quoc V Le. 2021. Finetuned Language
Models are Zero-Shot Learners. In International Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye,
Conference on Learning Representations. Ming Yan, Yiyang Zhou, Junyang Wang, An-
wen Hu, Pengcheng Shi, Yaya Shi, et al. 2023.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong mplug-owl: Modularization empowers large lan-
Wang, Zecheng Tang, and Nan Duan. 2023a. guage models with multimodality. arXiv preprint
Visual chatgpt: Talking, drawing and editing arXiv:2304.14178.
with visual foundation models. arXiv preprint
arXiv:2303.04671. Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun,
Tong Xu, and Enhong Chen. 2023a. A Survey on
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Multimodal Large Language Models. arXiv preprint
Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu arXiv:2306.13549.
Sun, Qiong Yan, Guangtao Zhai, et al. 2023b. Q-
bench: A benchmark for general-purpose founda- Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi,
tion models on low-level vision. arXiv preprint Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xi-
arXiv:2309.14181. aoshui Huang, Zhiyong Wang, et al. 2023b. Lamm:
Language-assisted multi-modal instruction-tuning
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming dataset, framework, and benchmark. arXiv preprint
Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen arXiv:2306.06687.
Lin, Yanwei Fu, et al. 2017. Ai challenger: A large-
scale dataset for going deeper in image understanding. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
arXiv preprint arXiv:1711.06475. enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng ference over event descriptions. Transactions of the
Wan, and Philip S Yu. 2023c. Multimodal large Association for Computational Linguistics, 2:67–78.
language models: A survey. arXiv preprint
arXiv:2311.13165. Licheng Yu, Patrick Poirson, Shan Yang, Alexander C
Berg, and Tamara L Berg. 2016. Modeling context
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and in referring expressions. In Computer Vision–ECCV
Tat-Seng Chua. 2023d. Next-gpt: Any-to-any multi- 2016: 14th European Conference, Amsterdam, The
modal llm. arXiv preprint arXiv:2309.05519. Netherlands, October 11-14, 2016, Proceedings, Part
II 14, pages 69–85. Springer.
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-
vtt: A large video description dataset for bridging Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
video and language. In Proceedings of the IEEE con- Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
ference on computer vision and pattern recognition, Wang. 2023. Mm-vet: Evaluating large multimodal
pages 5288–5296. models for integrated capabilities. arXiv preprint
arXiv:2308.02490.
Rongtao Xu, Changwei Wang, Jiaxi Sun, Shibiao Xu,
Weiliang Meng, and Xiaopeng Zhang. 2023a. Self Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang,
Correspondence Distillation For End-to-End Weakly- Jie Zhou, and Jiwen Lu. 2022. Point-bert: Pre-
Supervised Semantic Segmentation. In Proceedings training 3d point cloud transformers with masked
of the AAAI Conference on Artificial Intelligence. point modeling. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recog- Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas
nition, pages 19313–19322. Guibas, and Jitendra Malik. 2020. Side-tuning: a
baseline for network adaptation via additive side net-
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, works. In Computer Vision–ECCV 2020: 16th Euro-
Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusu- pean Conference, Glasgow, UK, August 23–28, 2020,
pati, Jack Hessel, Ali Farhadi, and Yejin Choi. 2022. Proceedings, Part III 16, pages 698–714. Springer.
Merlot reserve: Neural script knowledge through
vision and language and sound. In Proceedings of Susan Zhang, Stephen Roller, Naman Goyal, Mikel
the IEEE/CVF Conference on Computer Vision and Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Pattern Recognition, pages 16375–16387. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2022b. Opt: Open pre-trained transformer language
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, models. arXiv preprint arXiv:2205.01068.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, et al. 2022a. GLM-130B: Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan
An Open Bilingual Pre-trained Model. In The Zhou, Nedim Lipka, Diyi Yang, and Tong Sun.
Eleventh International Conference on Learning Rep- 2023f. Llavar: Enhanced visual instruction tuning
resentations. for text-rich image understanding. arXiv preprint
arXiv:2306.17107.
Yan Zeng, Xinsong Zhang, and Hang Li. 2022b. Multi-
Grained Vision Language Pre-Training: Aligning Bo Zhao, Boya Wu, and Tiejun Huang. 2023a. Svit:
Texts with Visual Concepts. In International Con- Scaling up visual instruction tuning. arXiv preprint
ference on Machine Learning, pages 25994–26009. arXiv:2307.04087.
PMLR.
Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Hao-
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, ran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng,
Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023a. Runpei Dong, Chunrui Han, et al. 2023b. Chatspot:
SpeechGPT: Empowering Large Language Models Bootstrapping multimodal llms via precise referring
with Intrinsic Cross-Modal Conversational Abilities. instruction tuning. arXiv preprint arXiv:2307.09474.
In Findings of the Association for Computational Lin-
Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. 2022.
guistics: EMNLP 2023, Singapore, December 6-10,
EGSDE: Unpaired Image-to-Image Translation via
2023, pages 15757–15773.
Energy-Guided Stochastic Differential Equations. In
Duzhen Zhang, Wei Cong, Jiahua Dong, Yahan Yu, Xi- Advances in Neural Information Processing Systems
uyi Chen, Yonggang Zhang, and Zhen Fang. 2023b. 35: Annual Conference on Neural Information Pro-
Continual Named Entity Recognition without Catas- cessing Systems 2022, NeurIPS 2022, New Orleans,
trophic Forgetting. In The 2023 Conference on Em- LA, USA, November 28 - December 9, 2022.
pirical Methods in Natural Language Processing.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Duzhen Zhang, Hongliu Li, Wei Cong, Rongtao Xu, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Jiahua Dong, and Xiuyi Chen. 2023c. Task relation Zhang, Junjie Zhang, Zican Dong, et al. 2023c. A
distillation and prototypical pseudo label for incre- survey of large language models. arXiv preprint
mental named entity recognition. In Proceedings of arXiv:2303.18223.
the 32nd ACM International Conference on Informa- Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang,
tion and Knowledge Management, pages 3319–3329. Jiashi Feng, and Bingyi Kang. 2023d. Bubogpt: En-
Duzhen Zhang, Yahan Yu, Feilong Chen, and Xiuyi abling visual grounding in multi-modal llms. arXiv
Chen. 2023d. Decomposing Logits Distillation for preprint arXiv:2307.08581.
Incremental Named Entity Recognition. In Proceed- Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, and
ings of the 46th International ACM SIGIR Confer- Huawen Feng. 2024. Beyond Anti-Forgetting: Mul-
ence on Research and Development in Information timodal Continual Instruction Tuning with Positive
Retrieval, pages 1919–1923. Forward Transfer. arXiv preprint arXiv:2401.09181.
Duzhen Zhang, Tielin Zhang, Shuncheng Jia, Qingyu Junhao Zheng, Shengjie Qiu, and Qianli Ma. 2023a.
Wang, and Bo Xu. 2022a. Recent Advances and New Learn or Recall? Revisiting Incremental Learning
Frontiers in Spiking Neural Networks. In Proceed- with Pre-trained Language Models. arXiv preprint
ings of the Thirty-First International Joint Confer- arXiv:2312.07887.
ence on Artificial Intelligence, IJCAI 2022, Vienna,
Austria, 23-29 July 2022, pages 5670–5677. Kaizhi Zheng, Xuehai He, and Xin Eric Wang. 2023b.
Minigpt-5: Interleaved vision-and-language gen-
Hang Zhang, Xin Li, and Lidong Bing. 2023e. Video- eration via generative vokens. arXiv preprint
LLaMA: An Instruction-tuned Audio-Visual Lan- arXiv:2310.02239.
guage Model for Video Understanding. In Proceed-
ings of the 2023 Conference on Empirical Methods Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
in Natural Language Processing, EMNLP 2023 - Mohamed Elhoseiny. 2023a. Minigpt-4: Enhancing
System Demonstrations, Singapore, December 6-10, vision-language understanding with advanced large
2023, pages 543–553. language models. arXiv preprint arXiv:2304.10592.
Wanrong Zhu, Jack Hessel, Anas Awadalla, A Related Surveys
Samir Yitzhak Gadre, Jesse Dodge, Alex Fang,
Youngjae Yu, Ludwig Schmidt, William Yang Wang, Prior to the emergence of LLMs, several surveys
and Yejin Choi. 2023b. Multimodal c4: An open, on traditional MM PT have been conducted (Ruan
billion-scale corpus of images interleaved with text. and Jin, 2022; Du et al., 2022a; Long et al., 2022;
arXiv preprint arXiv:2304.06939.
Chen et al., 2023a). Most of these models entail a
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei- substantial computational cost during the PT phase,
Fei. 2016. Visual7w: Grounded question answering attributable to end-to-end training using large-scale
in images. In Proceedings of the IEEE conference
models and datasets. As a consequence of not incor-
on computer vision and pattern recognition, pages
4995–5004. porating LLMs, these models suffer from deficien-
cies in instruction following, ICL, CoT, and inter-
active capabilities. Moreover, the training pipeline
solely encompasses the PT phase without the inclu-
sion of an IT stage.
In recent times, several surveys have emerged
on MM-LLMs. Yin et al. and Wu et al. exclu-
sively delve into early VL understanding models.
Huang et al. place a primary emphasis on visual IT,
while Song et al. focus on modal alignment meth-
ods. Lastly, Cui et al. provide a comprehensive
review of the applications of MM-LLMs within the
realm of autonomous driving.
Compared with their works, the main distinc-
tions are outlined as follows:
• We have comprehensively covered nearly all
MM-LLMs over the past year, including not
only understanding models but also generative
models. Our coverage extends beyond VL
modalities to encompass various modes such
as audio and 3D;
• To offer readers a comprehensive understand-
ing of MM-LLMs, we have introduced a gen-
eral model architecture that incorporates any-
to-any modality transformations, offering a
detailed overview of the functional roles and
implementation choices for each component;
• We have summarized the developmental
trends of existing MM-LLMs and provided
some training recipes that can enhance effec-
tiveness;
• We have established an open-source website
for MM-LLMs researchers, supporting crowd-
sourced updates and aiming to facilitate col-
laboration in the MM-LLMs field. We antic-
ipate that this survey will illuminate future
research in the MM-LLMs domain.
Table 3: The statistics for MM PT datasets. #.X represents the quantity of X, #.T represents the quantity of Text,
and #.X-T represents the quantity of X-Text pairs, where X can be Image, Video, or Audio.
Dataset Name Type I→O Source Method Multi-Turn #.I/V/A #.Dialog Turn #.Instance
MiniGPT-4’s IT (Zhu et al., 2023a) SFT I+T→T CC3M, CC12M Auto. % 134M/–/– 1 5K
StableLLaVA (Li et al., 2023f) SFT I+T→T SD (Rombach et al., 2022) Auto.+Manu. % 126K/–/– 1 126K
LLaVA’s IT (Zhang et al., 2023f) SFT I+T→T MS-COCO Auto. " 81K/–/– 2.29 150K
SVIT (Zhao et al., 2023a) SFT I+T→T MS-COCO, Visual Genome Auto. " 108K/–/– 5 3.2M
LLaVAR (Zhang et al., 2023f) SFT I+T→T MS-COCO, CC3M, LAION LLaVA+Auto. " 20K/–/– 2.27 174K
ShareGPT4V (Chen et al., 2023e) SFT I+T→T LCS, COCO, SAM, TextCaps, WikiArt Auto.+Manu. % 100K/–/– – –
DRESS’s IT (Chen et al., 2023h) SFT I+T→T LLaVA’s IT, VLSafe Auto.+Manu. " 193K/–/– ∼4 –
VideoChat’s IT (Li et al., 2023d) SFT V+T→T WebVid Auto. " –/8K/– 1.82 11K
Video-ChatGPT’s IT (Maaz et al., 2023) SFT V+T→T ActivityNet (Caba Heilbron et al., 2015) Inherit " –/100K/– 1 100K
Video-LLaMA’s IT (Zhang et al., 2023e) SFT I/V+T→T MiniGPT-4, LLaVA, and VideoChat’s IT Auto. " 81K/8K/– 2.22 171K
InstructBLIP’s IT (Dai et al., 2023) SFT I/V+T→T Multiple (InstructBLIP’s Figure 2) Auto. % – – ∼1.6M
X-InstructBLIP’s IT (Panagopoulou et al., 2023) SFT I/V/A/3D+T→T Multiple (X-InstructBLIP’s Figure 4) Auto. % – – ∼1.8M
MIMIC-IT (Li et al., 2023a) SFT I/V+T→T Multiple Auto. % 8.1M/502K/– 1 2.8M
PandaGPT’s IT (Su et al., 2023) SFT I+T→T MiniGPT-4 and LLaVA’s IT Inherit " 81K/–/– 2.29 160K
MGVLID (Zhao et al., 2023b) SFT I+B+T→T Multiple Auto.+Manu. % 108K/–/– – 108K
M3 IT (Li et al., 2023e) SFT I/V/B+T→T Multiple Auto.+Manu. % –/–/– 1 2.4M
LAMM (Yin et al., 2023b) SFT I+3D+T→T Multiple Auto.+Manu. " 91K/–/– 3.27 196K
BuboGPT’s IT (Zhao et al., 2023d) SFT (I+A)/A+T→T Clotho, VGGSS Auto. % 5K/–/9K – 9K
mPLUG-DocOwl’s IT (Ye et al., 2023) SFT I/Tab/Web+T→T Multiple Inherit % – – –
T2M (Wu et al., 2023d) SFT T→I/V/A+T WebVid, CC3M, AudioCap Auto. % 4.9K/4.9K/4.9K 1 14.7K
MosIT (Wu et al., 2023d) SFT I+V+A+T→I+V+A+T Youtube, Google, Flickr30k, Midjourney, etc. Auto.+Manu. " 4K/4K/4K 4.8 5K
DRESS’s IT (Chen et al., 2023h) RLHF I+T→T LLaVA’s IT, VLSafe Auto.+Manu. " 33K/–/– ∼4 –
Table 4: The statistics for MM IT datasets. I→O: Input to Output Modalities, T: Text, I: Image, V: Video, A: Audio,
B: Bounding box, 3D: Point Cloud, Tab: Table, and Web: Web page.