0% found this document useful (0 votes)
148 views

MM-LLMs Recent Advances in MultiModal Large Language Models

This document provides a comprehensive survey of recent advances in multi-modal large language models (MM-LLMs). It outlines general design formulations for MM-LLM architectures and training pipelines. It then briefly introduces 26 existing MM-LLMs, characterized by their specific formulations. It reviews MM-LLLM performance on benchmarks and summarizes key training recipes. Finally, it explores promising future directions for MM-LLMs and maintains a website tracking developments in the field. The survey aims to facilitate further MM-LLM research.

Uploaded by

huangyuanshui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
148 views

MM-LLMs Recent Advances in MultiModal Large Language Models

This document provides a comprehensive survey of recent advances in multi-modal large language models (MM-LLMs). It outlines general design formulations for MM-LLM architectures and training pipelines. It then briefly introduces 26 existing MM-LLMs, characterized by their specific formulations. It reviews MM-LLLM performance on benchmarks and summarizes key training recipes. Finally, it explores promising future directions for MM-LLMs and maintains a website tracking developments in the field. The survey aims to facilitate further MM-LLM research.

Uploaded by

huangyuanshui
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

MM-LLMs: Recent Advances in MultiModal Large Language Models

Duzhen Zhang1 * , Yahan Yu2 * , Chenxing Li1 , Jiahua Dong3† , Dan Su1 ,
Chenhui Chu2† and Dong Yu1
1
Tencent AI Lab
2
Kyoto University
3
Shenyang Institute of Automation, Chinese Academy of Sciences
[email protected], [email protected]

Abstract Nov. ~ Dec.


CogVLM Qwen-Audio DRESS X-InstructBLIP CoDi-2 VILA Gemini MobileVLM

Sep. ~ Oct.

In the past year, MultiModal Large Language NExT-GPT MiniGPT-5 LLaVA-1.5 MiniGPT-v2 Fuyu-8B

Models (MM-LLMs) have undergone substan- Jul. ~ Aug.


DLP ChatSpot BuboGPT OpenFlamingo IDEFICS Qwen-VL
arXiv:2401.13601v2 [cs.CL] 25 Jan 2024

tial advancements, augmenting off-the-shelf Jun.


Video-LLaMA Video-ChatGPT Kosmos-2 Shikra
LLMs to support MM inputs or outputs via
May.
cost-effective training strategies. The resulting X-LLM MM-GPT VideoChat PaLI-X SpeechGPT EmbodiedGPT PandaGPT InstructBLIP

models not only preserve the inherent reasoning Apr.


LLaVA MiniGPT-4 AudioGPT mPLUG-Owl
and decision-making capabilities of LLMs but
Mar.
also empower a diverse range of MM tasks. In PaLM-E Visual ChatGPT ViperGPT GPT-4 MM-REACT HuggingGPT

this paper, we provide a comprehensive survey Jan. ~ Feb.

BLIP-2 Kosmos-1
aimed at facilitating further research of MM- 2023
Apr.

LLMs. Specifically, we first outline general 2022 Flamingo

design formulations for model architecture and


training pipeline. Subsequently, we provide Figure 1: The timeline of MM-LLMs.
brief introductions of 26 existing MM-LLMs,
each characterized by its specific formulations.
Additionally, we review the performance of logical approach is to capitalize on readily avail-
MM-LLMs on mainstream benchmarks and able pre-trained unimodal foundation models, with
summarize key training recipes to enhance the a special emphasis on powerful Large Language
potency of MM-LLMs. Lastly, we explore Models (LLMs) (OpenAI, 2022). This strategy
promising directions for MM-LLMs while con-
aims to mitigate computational expenses and en-
currently maintaining a real-time tracking web-
site1 for the latest developments in the field. We
hance the efficacy of MM pre-training, leading to
hope that this survey contributes to the ongoing the emergence of a novel field: MM-LLMs.
advancement of the MM-LLMs domain. MM-LLMs harness LLMs as the cognitive pow-
erhouse to empower various MM tasks. LLMs
1 Introduction contribute desirable properties like robust language
MultiModal (MM) pre-training research has wit- generation, zero-shot transfer capabilities, and
nessed significant advancements in recent years, In-Context Learning (ICL). Concurrently, foun-
consistently pushing the performance boundaries dation models in other modalities provide high-
across a spectrum of downstream tasks (Li et al., quality representations. Considering foundation
2020; Akbari et al., 2021; Fang et al., 2021; Yan models from different modalities are individually
et al., 2021; Li et al., 2021; Radford et al., 2021; Li pre-trained, the core challenge facing MM-LLMs
et al., 2022; Zellers et al., 2022; Zeng et al., 2022b; is how to effectively connect the LLM with models
Yang et al., 2022; Wang et al., 2022a,b). How- in other modalities to enable collaborative infer-
ever, as the scale of models and datasets continues ence. The predominant focus within this field has
to expand, traditional MM models incur substan- been on refining alignment between modalities and
tial computational costs, particularly when trained aligning with human intent via a MM Pre-Training
from scratch. Recognizing that MM research op- (PT) + MM Instruction-Tuning (IT) pipeline.
erates at the intersection of various modalities, a With the debut of GPT-4(Vision) (OpenAI,
* Equal contributions.
2023) and Gemini (Team et al., 2023), show-

Corresponding authors. casing impressive MM understanding and gen-
1
https://mm-llms.github.io eration capabilities, a research fervor on MM-
LLMs has been sparked. Initial research pri- comprehensively review the performance of major
marily focuses on MM content comprehension MM-LLMs on mainstream benchmarks and dis-
and text generation like (Open)Flamingo (Alayrac till key training recipes to enhance the efficacy of
et al., 2022; Awadalla et al., 2023), BLIP-2 (Li MM-LLMs. In Section 6, we offer promising direc-
et al., 2023c), Kosmos-1 (Huang et al., 2023c), tions for MM-LLMs research. Moreover, we have
LLaVA/LLaVA-1.5 (Liu et al., 2023e,d), MiniGPT- established a website (https://mm-llms.github.io)
4 (Zhu et al., 2023a), MultiModal-GPT (Gong to track the latest progress of MM-LLMs and fa-
et al., 2023), VideoChat (Li et al., 2023d), Video- cilitate crowd-sourcing updates. Finally, we sum-
LLaMA (Zhang et al., 2023e), IDEFICS (IDEFICS, marize the entire paper in Section 7 and discuss
2023), Fuyu-8B (Bavishi et al., 2023), and Qwen- related surveys on MM-LLMs in Appendix A. We
Audio (Chu et al., 2023b). In pursuit of MM-LLMs aspire for our survey to aid researchers in gaining
capable of both MM input and output (Aiello et al., a deeper understanding of this field and to inspire
2023), some studies additionally explore the gen- the design of more effective MM-LLMs.
eration of specific modalities, such as Kosmos-
2 (Peng et al., 2023) and MiniGPT-5 (Zheng 2 Model Architecture
et al., 2023b) introducing image generation, and In this section, we provide a detailed overview
SpeechGPT (Zhang et al., 2023a) introducing of the five components comprising the general
speech generation. Recent research endeavors model architecture, along with the implementation
have focused on mimicking human-like any-to- choices for each component, as illustrated in Fig-
any modality conversion, shedding light on the ure 2. MM-LLMs that emphasize MM understand-
path to artificial general intelligence. Some efforts ing only include the first three components. During
aim to amalgamate LLMs with external tools to training, Modality Encoder, LLM Backbone, and
reach an approaching ‘any-to-any’ MM comprehen- Modality Generator are generally maintained in a
sion and generation, such as Visual-ChatGPT (Wu frozen state. The primary optimization emphasis
et al., 2023a), ViperGPT (Surís et al., 2023), MM- is on Input and Output Projectors. Given that Pro-
REACT (Yang et al., 2023), HuggingGPT (Shen jectors are lightweight components, the proportion
et al., 2023), and AudioGPT (Huang et al., 2023b). of trainable parameters in MM-LLMs is notably
Conversely, to mitigate propagated errors in the cas- small compared to the total parameter count (typi-
cade system, initiatives like NExT-GPT (Wu et al., cally around 2%). The overall parameter count is
2023d) and CoDi-2 (Tang et al., 2023b) have devel- contingent on the scale of the core LLM utilized
oped end-to-end MM-LLMs of arbitrary modalities. in the MM-LLMs. As a result, MM-LLMs can be
The timeline of MM-LLMs is depicted in Figure 1. efficiently trained to empower various MM tasks.
In this paper, we present a comprehensive survey
aimed at facilitating further research of MM-LLMs. 2.1 Modality Encoder
To provide readers with a holistic understanding of The Modality Encoder (ME) is tasked with encod-
MM-LLMs, we initially delineate general design ing inputs from diverse modalities IX to obtain
formulations from model architecture (Section 2) corresponding features FX , formulated as follows:
and training pipeline (Section 3). We break down
the general model architecture into five compo- FX = MEX (IX ). (1)
nents: Modality Encoder (Section 2.1), Input Pro-
Various pre-trained encoder options MEX exist
jector (Section 2.2), LLM Backbone (Section 2.3),
for handling different modalities, where X can be
Output Projector (Section 2.4), and Modality Gen-
image, video, audio, 3D, or etc. Next, we will offer
erator (Section 2.5). The training pipeline eluci-
a concise introduction organized by modality.
dates how to enhance a pre-trained text-only LLM
to support MM input or output, primarily consist- Visual Modality For images, there are gener-
ing of two stages: MM PT (Section 3.1) and MM ally four optional encoders: NFNet-F6 (Brock
IT (Section 3.2). In this section, we also provide a et al., 2021), ViT (Dosovitskiy et al., 2020),
summary of mainstream datasets for MM PT and CLIP ViT (Radford et al., 2021), and Eva-CLIP
MM IT. Next, we engage in discussions of 26 State- ViT (Fang et al., 2023). NFNet-F6 is a normalizer-
of-the-Art (SOTA) MM-LLMs, each characterized free ResNet (He et al., 2016), showcasing an adap-
by specific formulations, and summarize their de- tive gradient clipping technique that allows training
velopment trends in Section 4. In Section 5, we on extensively augmented datasets while achieving
Modality Encoder !"! "$ LLM Backbone Modality Generator "#!
Text $
NFNet-F6
Flan-T5 ChatGLM
Input Projector Θ!→# Output Projector Θ$→%
ViT
Linear Stable
Image Projector Image
Diffusion
Video UL2 Qwen
Image
CLIP ViT MLP
Tiny
!! "! #! S! Transformer
%!
Eva-CLIP ViT Cross- Chinchilla OPT Video Zeroscope
Video aAenBon
MLP

C-Former Q-Former
PaLM LLaMA …
Audio Audio AudioLDM
Audio P-Former


HuBERT

"
LLaMA-2 Vicuna " …
… BEATs

… …
Unified
ImageBind
…❄ ❄ ❄

Mul.modal Understanding Mul.modal Genera.on

Figure 2: The general model architecture of MM-LLMs and the implementation choices for each component.

SOTA levels of image recognition. ViT applies 2.2 Input Projector


the Transformer (Vaswani et al., 2017) to images The Input Projector ΘX→T is tasked with align-
by first dividing the image into patches. It then ing the encoded features of other modalities FX
undergoes linear projection to flatten the patches, with the text feature space T . The aligned fea-
followed by encoding via multiple Transformer tures as prompts PX are then fed into the LLM
blocks. CLIP ViT builds connections between text Backbone alongside the textual features FT . Given
and images, comprising a ViT and a text encoder. X-text dataset {IX , t}, the goal is to minimize the
Utilizing a vast amount of text-image pairs, it opti- X-conditioned text generation loss Ltxt-gen :
mizes ViT by contrastive learning, treating paired
text and images as positive samples and others as arg min Ltxt-gen (LLM(PX , FT ), t), (2)
negative ones. Its Eva version stabilizes the train- ΘX→T
ing and optimization process of the massive CLIP,
where PX = ΘX→T (FX ).
offering new directions in expanding and accelerat-
The Input Projector can be achieved directly
ing the expensive training of MM base models. For
by a Linear Projector or Multi-Layer Perceptron
videos, they can be uniformly sampled to 5 frames,
(MLP), i.e., several linear projectors interleaved
undergoing the same pre-processing as images.
with non-linear activation functions. There are
Audio Modality is typically encoded by C- also more complex implementations like Cross-
Former (Chen et al., 2023b), HuBERT (Hsu et al., attention, Q-Former (Li et al., 2023c), or P-
2021), BEATs (Chen et al., 2023f), and Whis- Former (Jian et al., 2023). Cross-attention uses a
per (Radford et al., 2023). C-Former employs the set of trainable vectors as queries and the encoded
CIF alignment mechanism (Dong and Xu, 2020; features FX as keys to compress the feature se-
Zhang et al., 2022a) for sequence transduction and quence to a fixed length. The compressed represen-
a Transformer to extract audio features. HuBERT tation is then fed directly into the LLM (Bai et al.,
is a self-supervised speech representation learning 2023b) or further used for X-text cross-attention
framework based on BERT (Kenton and Toutanova, fusion (Alayrac et al., 2022). Q-Former extracts
2019), achieved by the masked prediction of dis- relevant features from FX , and the selected fea-
crete hidden units. BEATs is an iterative audio pre- tures are then used as prompts PX . Meanwhile, P-
training framework designed to learn Bidirectional Former generates ‘reference prompts’, imposing
Encoder representations from Audio Transformers. an alignment constraint on the prompts produced
by Q-Former. However, both Q- and P-Former
3D Point Cloud Modality is typically encoded require a separate PT process for initialization.
by ULIP-2 (Salesforce, 2022; Xu et al., 2023a,b)
with a PointBERT (Yu et al., 2022) backbone. 2.3 LLM Backbone
Moreover, to handle numerous heterogeneous Taking LLMs (Zhao et al., 2023c; Naveed et al.,
modal encoders, some MM-LLMs, particularly 2023; Luo et al., 2023) as the core agents, MM-
any-to-any ones, use ImageBind (Girdhar et al., LLMs can inherit some notable properties like
2023), a unified encoder covering six modalities, zero-shot generalization, few-shot ICL, Chain-of-
including image, video, text, audio, heat map, etc. Thought (CoT), and instruction following. The
LLM Backbone processes representations from var- existing works use off-the-shelf Latent Diffusion
ious modalities, engaging in semantic understand- Models (LDMs) (Zhao et al., 2022), i.e., Stable
ing, reasoning, and decision-making regarding the Diffusion (Rombach et al., 2022) for image syn-
inputs. It produces (1) direct textual outputs t, and thesis, Zeroscope (Cerspense, 2023) for video syn-
(2) signal tokens SX from other modalities (if any). thesis, and AudioLDM-2 (Liu et al., 2023b,c) for
These signal tokens act as instructions to guide the audio synthesis. The features HX mapped by the
generator on whether to produce MM contents and, Output Projector serve as conditional inputs in the
if affirmative, specifying the content to produce: denoising process to generate MM content. Dur-
ing training, the ground truth content is first trans-
t, SX = LLM(PX , FT ), (3)
formed into a latent feature z0 by the pre-trained
where the aligned representations of other modal- VAE (Kingma and Welling, 2013). Then, noise ϵ is
ities PX can be considered as soft Prompt-tuning added to z0 to obtain the noisy latent feature zt . A
for the LLM Backbone. Moreover, some research pre-trained Unet (Ronneberger et al., 2015) ϵX is
works have introduced Parameter-Efficient Fine- used to compute the conditional LDM loss LX-gen
Tuning (PEFT) methods, such as Prefix-tuning (Li as follows:
and Liang, 2021), Adapter (Houlsby et al., 2019),
and LoRA (Hu et al., 2021). In these cases, the LX-gen := Eϵ∼N (0,1),t ||ϵ − ϵX (zt , t, HX )||22 , (5)
number of additional trainable parameters is excep- optimize parameters ΘX→T and ΘT →X by mini-
tionally minimal, even less than 0.1% of the total mizing LX-gen .
LLM parameter count. We provide an introduction
to mainstream PEFT methods in Appendix B. 3 Training Pipeline
The commonly used LLMs in MM-LLMs incude MM-LLMs’ training pipeline can be delineated
Flan-T5 (Chung et al., 2022), ChatGLM (Zeng into two principal stages: MM PT and MM IT.
et al., 2022a), UL2 (Tay et al., 2022), Qwen (Bai
et al., 2023a), Chinchilla (Hoffmann et al., 2022), 3.1 MM PT
OPT (Zhang et al., 2022b), PaLM (Chowd- During the PT stage, typically leveraging the X-
hery et al., 2023), LLaMA (Touvron et al., Text datasets, Input and Output Projectors are
2023a), LLaMA-2 (Touvron et al., 2023b), and trained to achieve alignment among various modal-
Vicuna (Chiang et al., 2023). We provide a brief ities by optimizing predefined objectives (PEFT is
introduction to these models in Appendix C. sometimes applied to the LLM Backbone). For
2.4 Output Projector MM understanding models, optimization focuses
solely on Equation (2), while for MM generation
The Output Projector ΘT →X maps the signal to- models, optimization involves Equations (2), (4),
ken representations SX from the LLM Backbone and (5). In the latter case, Equation (2) also in-
into features HX understandable to the follow- cludes the ground-truth signal token sequence.
ing Modality Generator MGX . Given the X-text The X-Text datasets encompass Image-Text,
dataset {IX , t}, t is first fed into LLM to generate Video-Text, and Audio-Text, with Image-Text
the corresponding SX , then mapped into HX . To having two types: Image-Text pairs (i.e.,
facilitate alignment of the mapped features HX , <img1><txt1>) and interleaved Image-Text corpus
the goal is to minimize the distance between HX (i.e., <txt1><img1><txt2><txt3><img2><txt4>).
and the conditional text representations of MGX : The detailed statistics for these X-Text datasets are
arg min Lmse (HX , τX (t)). (4) presented in Table 3 of Appendix F.
ΘT →X
3.2 MM IT
The optimization only relies on captioning texts,
without utilizing any audio or visual resources X, MM IT is a methodology that entails the fine-tuning
where HX = ΘT →X (SX ) and τX is the textual of pre-trained MM-LLMs using a set of instruction-
condition encoder in MGX . The Output Projector formatted datasets (Wei et al., 2021). Through this
is implemented by a Tiny Transformer or MLP. tuning process, MM-LLMs can generalize to un-
seen tasks by adhering to new instructions, thereby
2.5 Modality Generator enhancing zero-shot performance. This straightfor-
The Modality Generator MGX is tasked with pro- ward yet impactful concept has catalyzed the suc-
ducing outputs in distinct modalities. Commonly, cess of subsequent endeavors in the field of NLP,
exemplified by works such as InstructGPT (Ouyang 2023e) pioneers the transfer of IT techniques to
et al., 2022), OPT-IML (Iyer et al., 2022), and In- the MM domain. Addressing data scarcity, LLaVA
structBLIP (Dai et al., 2023). introduces a novel open-source MM instruction-
MM IT comprises Supervised Fine-Tuning following dataset created using ChatGPT/GPT-4,
(SFT) and Reinforcement Learning from Human alongside the MM instruction-following bench-
Feedback (RLHF), aiming to align with human in- mark, LLaVA-Bench. (4) MiniGPT-4 (Zhu et al.,
tents or preferences and enhance the interaction 2023a) proposes a streamlined approach where
capabilities of MM-LLMs. SFT converts part of training only one linear layer aligns the pre-trained
the PT stage data into an instruction-aware format. vision encoder with the LLM. This efficient method
Using visual Question-Answer (QA) as an example, enables the replication of the exhibited capabili-
various templates may be employed like (1) <Im- ties of GPT-4. (5) mPLUG-Owl (Ye et al., 2023)
age>{Question} A short answer to the question presents a novel modularized training framework
is; (2) <Image>Examine the image and respond to for MM-LLMs, incorporating the visual context.
the following question with a brief answer: {Ques- To assess different models’ performance in MM
tion}. Answer:; and so on. Next, it fine-tunes the tasks, the framework includes an instructional eval-
pre-trained MM-LLMs using the same optimiza- uation dataset called OwlEval. (6) X-LLM (Chen
tion objectives. The SFT dataset can be structured et al., 2023b) is expanded to various modalities, in-
as either single-turn QA or multi-turn dialogues. cluding audio, and demonstrates strong scalability.
After SFT, RLHF involves further fine-tuning Leveraging the language transferability of the Q-
of the model, relying on feedback regarding the Former, X-LLM is successfully applied in the con-
MM-LLMs’ responses (e.g., Natural Language text of Sino-Tibetan Chinese. (7) VideoChat (Li
Feedback (NLF) labeled manually or automati- et al., 2023d) pioneers an efficient chat-centric
cally) (Sun et al., 2023). This process employs MM-LLM for video understanding dialogue, set-
a reinforcement learning algorithm to effectively ting standards for future research in this domain
integrate the non-differentiable NLF. The model is and offering protocols for both academia and in-
trained to generate corresponding responses con- dustry. (8) InstructBLIP (Dai et al., 2023) is
ditioned on the NLF (Chen et al., 2023h; Akyürek trained based on the pre-trained BLIP-2 model,
et al., 2023). The statistics for SFT and RLHF updating only the Q-Former during MM IT. By
datasets are presented in Table 4 of Appendix F. introducing instruction-aware visual feature extrac-
The datasets used by existing MM-LLMs in the tion and corresponding instructions, the model en-
MM PT and MM IT stages are diverse, but they are ables the extraction of flexible and diverse features.
all subsets of the datasets in Tables 3 and 4. (9) PandaGPT (Su et al., 2023) is a pioneering
general-purpose model with the capability to com-
4 SOTA MM-LLMs prehend and act upon instructions across 6 differ-
ent modalities: text, image/video, audio, thermal,
Based on the previously defined design formula- depth, and inertial measurement units. (10) PaLI-
tions, we conduct a comprehensive comparison of X (Chen et al., 2023g) is trained using mixed VL
the architectures and training dataset scales for 26 objectives and unimodal objectives, including pre-
SOTA MM-LLMs, as illustrated in Table 1. Subse- fix completion and masked-token completion. This
quently, we will provide a concise introduction to approach proves effective for both downstream task
the core contributions of these models and summa- results and achieving the Pareto frontier in the fine-
rize their developmental trends. tuning setting. (11) Video-LLaMA (Zhang et al.,
(1) Flamingo (Alayrac et al., 2022) represents a 2023e) introduces a multi-branch cross-modal PT
series of Visual Language (VL) Models designed framework, enabling LLMs to simultaneously pro-
for processing interleaved visual data and text, gen- cess the vision and audio content of a given video
erating free-form text as the output. (2) BLIP-2 (Li while engaging in conversations with humans. This
et al., 2023c) introduces a more resource-efficient framework aligns vision with language as well as
framework, comprising the lightweight Q-Former audio with language. (12) Video-ChatGPT (Maaz
to bridge modality gaps and the utilization of frozen et al., 2023) is a model specifically designed for
LLMs. Leveraging LLMs, BLIP-2 can be guided video conversations, capable of generating discus-
for zero-shot image-to-text generation using nat- sions about videos by integrating spatiotemporal
ural language prompts. (3) LLaVA (Liu et al., vision representations. (13) Shikra (Chen et al.,
Model I→O Modality Encoder Input Projector LLM Backbone Output Projector Modality Generator #.PT #.IT
Flamingo I+V+T→T I/V: NFNet-F6 Cross-attention Chinchilla-1.4B/7B/70B (Frozen) – – – –
BLIP-2 I+T→T I: CLIP/Eva-CLIP ViT@224 Q-Former w/ Linear Projector Flan-T5/OPT (Frozen) – – 129M –
LLaVA I+T→T I: CLIP ViT-L/14 Linear Projector Vicuna-7B/13B (PT: Frozen; IT: PEFT) – – – –
MiniGPT-4 I+T→T I: Eva-CLIP ViT-G/14 Q-Former w/ Linear Projector Vicuna-13B (PT: Frozen; IT: PEFT) – – – –
mPLUG-Owl I+T→T I: CLIP ViT-L/14 Cross-attention LLaMA-7B(PT: Frozen; IT: PEFT) – – – –
X-LLM I+V+A+T→T I/V: ViT-G; A: C-Former Q-Former w/ Linear Projector ChatGLM-6B (Frozen) – – – –
VideoChat V+T→T I: ViT-G Q-Former w/ Linear Projector Vicuna (Frozen) – – – –
InstructBLIP I+V+T→T I/V: ViT-G/14@224 Q-Former w/ Linear Projector Flan-T5/Vicuna (Frozen) – – 129M 1.2M
PandaGPT I+T→T I: ImageBind Linear Projector Vicuna-13B (PEFT) – – – –
PaLI-X I+T→T I: ViT Linear Projector UL2-32B (PEFT) – – – –
Video-LLaMA I+V+A+T→T I/V: EVA-CLIP ViT-G/14; A: ImageBind Q-Former w/ Linear Projector Vicuna/LLaMA (Frozen) – – – –
Video-ChatGPT V+T→T I: CLIP ViT-L/14 Linear Projector Vicuna-v1.1 (Initialized with LLaVA, Frozen) – – – –
Shikra I+T→T I: CLIP ViT-L/14@224 Linear Projector Vicuna-7B/13B (PEFT) – – 600K 5.5M
DLP I+T→T I: CLIP/Eva-CLIP ViT Q-Former+P-Former w/ Linear Projector OPT/Flan-T5 (Frozen) – – – –
BuboGPT I+A+T→T I: CLIP/Eva-CLIP ViT; A: ImageBind Q-Former w/ Linear Projector Vicuna (Frozen) – – – –
ChatSpot I+T→T I: CLIP ViT-L/14 Linear Projector Vicuna-7B/LLaMA (PT: Frozen; IT: PEFT) – – – –
Qwen-VL-(Chat) I+T→T I: ViT@448 initialized from OpenClip’s ViT-bigG Cross-attention Qwen-7B (PT: Frozen; IT: PEFT) – – 1.4B† 50M†
NExT-GPT I+V+A+T→I+V+A+T I/V/A: ImageBind Linear Projector Vicuna-7B (PEFT) Tiny Transformer I: Stable Diffusion; V: Zeroscope; A: AudioLDM – –
MiniGPT-5 I+T→I+T I: Eva-CLIP ViT-G/14 Q-Former w/ Linear Projector Vicuna-7B (PEFT) Tiny Transformer w/ MLP I: StableDiffusion-2 – –
LLaVA-1.5 I+T→T I: CLIP ViT-L@336 MLP Vicuna-v1.5-7B/13B (PT: Frozen; IT: PEFT) – – 0.6M 0.7M
MiniGPT-v2 I+T→T I: Eva-CLIP ViT@448 Linear Projector LLaMA-2-Chat-7B (PEFT) – – – –
CogVLM I+T→T I: Eva-2-CLIP ViT MLP Vicuna-v1.5-7B (PEFT) – – – –
DRESS I+T→T I:Eva-CLIP ViT-G/14 Linear Projector Vicuna-v1.5-13B (PEFT) – – – –
X-InstructBLIP I+V+A+3D+T→T I/V: Eva-CLIP ViT-G/14; A: BEATs; 3D: ULIP-2 Q-Former w/ Linear Projector Vicuna-v1.1-7B/13B (Frozen) – – – –
CoDi-2 I+V+A+T→I+V+A+T I/V/A: ImageBind MLP LLaMA-2-Chat-7B (PT: Frozen; IT: PEFT) MLP I: Stable Diffusion-2.1; V: Zeroscope-v2; A: AudioLDM-2 – –
VILA I+T→T I: ViT@336 Linear Projector LLaMA-2-7B/13B (PEFT) – – 50M 1M

Table 1: The summary of 26 mainstream MM-LLMs. I→O: Input to Output Modalities, I: Image, V: Video, A:
Audio, 3D: Point Cloud, and T: Text. In Modality Encoder, “-L” represents Large, “-G” represents Giant, “/14”
indicates a patch size of 14, and “@224” signifies an image resolution of 224 × 224. #.PT and #.IT represent the
scale of the dataset during MM PT and MM IT, respectively. † includes in-house data that is not publicly accessible.

2023d) introduces a simple and unified pre-trained VL outputs for MM generation. The inclusion of
MM-LLM tailored for Referential Dialogue, a task classifier-free guidance during the training phase
involving discussions about regions and objects in enhances the quality of generation.
images. This model demonstrates commendable For introduction regarding the remaining
generalization ability, effectively handling unseen seven MM-LLMs, please refer to Appendix D,
settings. (14) DLP (Jian et al., 2023) proposes the which includes (20) LLaVA-1.5 (Liu et al.,
P-Former to predict the ideal prompt, trained on a 2023d), (21) MiniGPT-v2 (Chen et al.,
dataset of single-modal sentences. This showcases 2023c), (22) CogVLM (Wang et al., 2023),
the feasibility of single-modal training to enhance (23) DRESS (Chen et al., 2023h), (24) X-
MM learning. (15) BuboGPT (Zhao et al., 2023d) InstructBLIP (Panagopoulou et al., 2023), (25)
is a model constructed by learning a shared se- CoDi-2 (Tang et al., 2023a), and (26) VILA (Lin
mantic space for a comprehensive understanding et al., 2023).
of MM content. It explores fine-grained relation-
ships among different modalities such as image, Trends in Existing MM-LLMs: (1) Progressing
text, and audio. (16) ChatSpot (Zhao et al., 2023b) from a dedicated emphasis on MM understanding
introduces a simple yet potent method for finely to the generation of specific modalities and further
adjusting precise referring instructions for MM- evolving into any-to-any modality conversion (e.g.,
LLM, facilitating fine-grained interactions. The MiniGPT-4 → MiniGPT-5 → NExT-GPT); (2) Ad-
incorporation of precise referring instructions, con- vancing from MM PT to SFT and then to RLHF,
sisting of image- and region-level instructions, en- the training pipeline undergoes continuous refine-
hances the integration of multi-grained VL task ment, striving to better align with human intent
descriptions. (17) Qwen-VL (Bai et al., 2023b) is and enhance the model’s conversational interac-
a multi-lingual MM-LLM that supports both En- tion capabilities (e.g., BLIP-2 → InstructBLIP →
glish and Chinese. Qwen-VL also allows the input DRESS); (3) Embracing Diversified Modal Exten-
of multiple images during the training phase, im- sions (e.g., BLIP-2 → X-LLM and InstructBLIP
proving its ability to understand the vision context. → X-InstructBLIP); (4) Incorporating a Higher-
(18) NExT-GPT (Wu et al., 2023d) is an end-to- Quality Training Dataset (e.g., LLaVA → LLaVA-
end, general-purpose any-to-any MM-LLM that 1.5); (5) Adopting a More Efficient Model Architec-
supports the free input and output of image, video, ture, transitioning from complex Q- and P-Former
audio, and text. It employs a lightweight alignment input projector modules in BLIP-2 and DLP to a
strategy, utilizing LLM-centric alignment in the en- simpler yet effective linear projector in VILA.
coding phase and instruction-following alignment
in the decoding phase. (19) MiniGPT-5 (Zheng 5 Benckmarks and Performance
et al., 2023b) is an MM-LLM integrated with inver-
sion to generative vokens and integration with Sta- To offer a comprehensive performance compari-
ble Diffusion. It excels in performing interleaved son, we have compiled a table featuring major
MM-LLMs across 18 VL benchmarks gathered
Model LLM Backbone OKVQA IconVQA VQAv2 GQA VizWiz SQAI VQAT POPE MMEP MMEC MMB MMBCN SEEDI LLaVAW MM-Vet QBench HM VSR
Flamingo Chinchilla-7B 44.7 – – – 28.8 – – – – – – – – – – – 57.0 31.8
BLIP-2 Flan-T5XXL (13B) 45.9 40.6 65.0 44.7 19.6 61.0 42.5 85.3 1293.8 290.0 – – 46.4 38.1 22.4 – 53.7 50.9
LLaVA Vicuna-13B 54.4 43.0 – 41.3 – – 38.9 – – – – – – – – – – 51.2
MiniGPT-4 Vicuna-13B 37.5 37.6 – 30.8 – – 19.4 – – – – – – – – – – 41.6
InstructBLIP Vicuna-7B – – – 49.2 34.5 60.5 50.1 – – – 36.0 23.7 53.4 60.9 26.2 56.7 – –
InstructBLIP Vicuna-13B – 44.8 – 49.5 33.4 63.1 50.7 78.9 1212.8 291.8 – – – 58.2 25.6 – 57.5 52.1
Shikra Vicuna-13B 47.2 – 77.4∗ – – – – – – – 58.8 – – – – 54.7 – –
IDEFICS-9B LLaMA-7B – – 50.9 38.4 35.5 – 25.9 – – – 48.2 25.2 – – – – – –
IDEFICS-80B LLaMA-65B – – 60.0 45.2 36.0 – 30.9 – – – 54.5 38.1 – – – – – –
Qwen-VL Qwen-7B – – 78.8∗ 59.3∗ 35.2 67.1 63.8 – – – 38.2 7.4 56.3 – – 59.4 – –
Qwen-VL-Chat Qwen-7B – – 78.2∗ 57.5∗ 38.9 68.2 61.5 – 1487.5 360.7 60.6 56.7 58.2 – – – – –
LLaVA-1.5 Vicuna-1.5-7B – – 78.5∗ 62.0∗ 50.0 66.8 58.2 85.9 1510.7 316.1‡ 64.3 58.3 58.6 63.4 30.5 58.7 – –
+ShareGPT4V Vicuna-1.5-7B – – 80.6 – 57.2 68.4 – – 1567.4 376.4 68.8 62.2 69.7 72.6 37.6 63.4 – –
LLaVA-1.5 Vicuna-1.5-13B – – 80.0∗ 63.3∗ 53.6 71.6 61.3 85.9 1531.3 295.4‡ 67.7 63.6 61.6 70.7 35.4 62.1 – –
MiniGPT-v2 LLaMA-2-Chat-7B 56.9 47.7 – 60.3 30.3 – 51.9 – – – – – – – – – 58.2 60.6
MiniGPT-v2-Chat LLaMA-2-Chat-7B 55.9 49.4 – 58.8 42.4 – 52.3 – – – – – – – – – 59.5 63.3
VILA-7B LLaMA-2-7B – – 79.9∗ 62.3∗ 57.8 68.2 64.4 85.5 1533.0 – 68.9 61.7 61.1 69.7 34.9 – – –
VILA-13B LLaMA-2-13B – – 80.8∗ 63.3∗ 60.6 73.7 66.6 84.2 1570.1 – 70.3 64.3 62.8 73.0 38.8 – – –
+ShareGPT4V LLaMA-2-13B – – 80.6∗ 63.2∗ 62.4 73.1 65.3 84.8 1556.5 – 70.8 65.4 61.4 78.4 45.7 – – –

Table 2: Comparison of mainstream MM-LLMs on 18 VL benchmarks. The red denotes the highest result, and the
blue denotes the second highest result. ‡ indicates ShareGPT4V’s (Chen et al., 2023e) re-implemented test results
missed in benchmarks or origin papers. ∗ The training images of the datasets are observed during training.

from various papers (Li et al., 2023c; Chen et al., tion (Honovich et al., 2022)) with image-text data
2023c,e; Lin et al., 2023), as shown in Table 2. The during SFT not only addresses the degradation of
information of these benchmarks can be found in text-only tasks but also enhances VL task accuracy.
Appendix E. Next, we will extract essential training
recipes that boost the effectiveness of MM-LLMs, 6 Future Directions
drawing insights from SOTA models.
In this section, we explore promising future direc-
Training Recipes Firstly, higher image resolu- tions for MM-LLMs across the following aspects:
tion can incorporate more visual details for the
model, benefiting tasks that require fine-grained More Powerful Models We can enhance the
details. For example, LLaVA-1.5 and VILA em- MM-LLMs’ strength from the following four key
ploy a resolution of 336 × 336, while Qwen-VL avenues: (1) Expanding Modalities: Current MM-
and MiniGPT-v2 utilize 448 × 448. However, LLMs typically support the following modalities:
higher resolutions lead to longer token sequences, image, video, audio, 3D, and text. However, the
incurring additional training and inference costs. real world involves a broader range of modalities.
MiniGPT-v2 addresses this by concatenating 4 adja- Extending MM-LLMs to accommodate additional
cent visual tokens in the embedding space to reduce modalities (e.g., web pages, heat maps, and fig-
length. Recently, Monkey (Li et al., 2023h) pro- ures&tables) will increase the model’s versatility,
posed a solution to enhance the resolution of input making it more universally applicable; (2) Diver-
images without retraining a high-resolution visual sifying LLMs: Incorporating various types and
encoder, utilizing only a low-resolution visual en- sizes of LLMs provides practitioners with the flexi-
coder, supporting resolutions up to 1300 × 800. To bility to select the most appropriate one based on
enhance the understanding of rich-text images, ta- their specific requirements; (3) Improving MM
bles, and document content, DocPedia (Feng et al., IT Dataset Quality: Current MM IT dataset have
2023) introduced a method to increase the visual ample room for improvement and expansion. Di-
encoder resolution to 2560 × 2560, overcoming versifying the range of instructions can enhance
the limitations of poorly performing low resolu- the effectiveness of MM-LLMs in understanding
tions in open-sourced ViT. Secondly, the incorpo- and executing user commands. (4) Strengthen-
ration of high-quality SFT data can significantly im- ing MM Generation Capabilities: Most current
prove performance in specific tasks, as evidenced MM-LLMs are predominantly oriented towards
by the addition of ShareGPT4V data to LLaVA-1.5 MM understanding. Although some models have
and VILA-13B, as shown in Table 2. Moreover, incorporated MM generation capabilities, the qual-
VILA reveals several key findings: (1) Performing ity of generated responses may be constrained by
PEFT on the LLM Backbone promotes deep em- the capacities of the LDMs. Exploring the inte-
bedding alignment, crucial for ICL; (2) Interleaved gration of retrieval-based approaches (Asai et al.,
Image-Text data proves beneficial, whereas Image- 2023) holds significant promise in complementing
Text pairs alone are sub-optimal; (3) Re-blending the generative process, potentially enhancing the
text-only instruction data (e.g., Unnatural instruc- overall performance of the model.
More Challenging Benchmarks Existing bench- gage with the real world and establishing a closed
marks might not adequately challenge the capa- loop that connects high-level planning with low-
bilities of MM-LLMs, given that many datasets level control. While MM-LLM-based Embodied
have previously appeared to varying degrees in the Intelligence has made advancements in integrat-
PT or IT sets. This implies that the models may ing with robots, further exploration is needed to
have learned these tasks during training. More- enhance the autonomy of robots.
over, current benchmarks predominantly concen-
Continual IT In practical applications, MM-
trate on the VL sub-field. Thus, it is crucial for
LLMs are expected to adapt to new MM tasks
the development of MM-LLMs to construct a more
for supporting additional functionalities. Never-
challenging, larger-scale benchmark that includes
theless, current MM-LLMs remain static and are
more modalities and uses a unified evaluation stan-
unable to adjust to continuously emerging require-
dard. Concurrently, benchmarks can be tailored to
ments. Therefore, an approach is needed to make
assess the MM-LLMs’ proficiency in practical ap-
the model flexible enough to efficiently and con-
plications. For instance, the introduction of GOAT-
tinually leverage emerging data, while avoiding
Bench (Lin et al., 2024) aims to evaluate various
the substantial cost of retraining MM-LLMs. This
MM-LLMs’ capacity to discern and respond to nu-
aligns with the principles of continual learning,
anced aspects of social abuse presented in memes.
where models are designed to incrementaly learn
Mobile/Lightweight Deployment To deploy new tasks similar to human learning. Continual
MM-LLMs on resource-constrained platforms and IT aims to continuously fine-tune MM-LLMs for
achieve optimal performance meanwhile, such as new MM tasks while maintaining superior perfor-
low-power mobile and IoT devices, lightweight mance on tasks learned during the original MM IT
implementations are of paramount importance. stage. It introduces two primary challenges: (1)
A notable advancement in this realm is Mo- catastrophic forgetting, where models forget previ-
bileVLM (Chu et al., 2023a). This approach strate- ous knowledge when learning new tasks (Robins,
gically downscales LLaMA, allowing for seam- 1995; McCloskey and Cohen, 1989; Goodfellow
less off-the-shelf deployment. MobileVLM further et al., 2013; Zhang et al., 2023d,c,b; Zheng et al.,
introduces a Lightweight Downsample Projector, 2023a), and (2) negative forward transfer, indicat-
consisting of fewer than 20 million parameters, con- ing that the performance of unseen tasks is declined
tributing to improved computational speed. Never- when learning new ones (Zheng et al., 2024; Dong
theless, this avenue necessitates additional explo- et al., 2023b,a). Recently, He et al. established a
ration for further advancements in development. benchmark to facilitate the development of contin-
ual IT for MM-LLMs. Despite these advancements,
Embodied Intelligence The embodied intelli-
there is still a significant opportunity and room for
gence aims to replicate human-like perception and
improvement in developing better methods to ad-
interaction with the surroundings by effectively
dress the challenges of catastrophic forgetting and
understanding the environment, recognizing perti-
negative forward transfer.
nent objects, assessing their spatial relationships,
and devising a comprehensive task plan (Firoozi 7 Conclusion
et al., 2023). Embodied AI tasks, such as embod-
ied planning, embodied visual question answer- In this paper, we have presented a comprehensive
ing, and embodied control, equips robots to au- survey of MM-LLMs with a focus on recent ad-
tonomously implement extended plans by leverag- vancements. Initially, we categorize the model
ing real-time observations. Some typical work in architecture into five components, providing a de-
this area is PaLM-E (Driess et al., 2023) and Em- tailed overview of general design formulations and
bodiedGPT (Mu et al., 2023). PaLM-E introduces training pipelines. Subsequently, we introduce var-
a multi-embodiment agent through the training of ious SOTA MM-LLMs, each distinguished by its
a MM-LLM. Beyond functioning solely as an em- specific formulations. Our survey also sheds light
bodied decision maker, PaLM-E also demonstrates on their capabilities across diverse MM bench-
proficiency in handling general VL tasks. Em- marks and envisions future developments in this
bodiedGPT introduces an economically efficient rapidly evolving field. We hope this survey can
method characterized through a CoT approach, en- provide insights for researchers, contributing to the
hancing the capability of embodied agents to en- ongoing advancements in the MM-LLMs domain.
Limitations training large autoregressive vision-language models.
arXiv preprint arXiv:2308.01390.
In this paper, we embark on a comprehensive explo-
ration of the current MM-LLMs landscape, present- Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang,
ing a synthesis from diverse perspectives enriched Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei
Huang, et al. 2023a. Qwen technical report. arXiv
by our insights. Acknowledging the dynamic na- preprint arXiv:2309.16609.
ture of this field, it is plausible that certain aspects
may have eluded our scrutiny, and recent advances Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang,
might not be entirely encapsulated. To tackle this Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou,
and Jingren Zhou. 2023b. Qwen-VL: A Frontier
inherent challenge, we’ve established a dedicated Large Vision-Language Model with Versatile Abili-
website for real-time tracking, using crowdsourc- ties. CoRR, abs/2308.12966.
ing to capture the latest advancements. Our goal is
for this platform to evolve into a continuous source Max Bain, Arsha Nagrani, Gül Varol, and Andrew Zis-
serman. 2021. Frozen in time: A joint video and
of contributions propelling ongoing development
image encoder for end-to-end retrieval. In Proceed-
in the field. Given the constraints of page limits, ings of the IEEE/CVF International Conference on
we are unable to delve into all technical details and Computer Vision, pages 1728–1738.
have provided concise overviews of the core contri-
butions of mainstream MM-LLMs. Looking ahead, Rohan Bavishi, Erich Elsen, Curtis Hawthorne,
Maxwell Nye, Augustus Odena, Arushi Somani, and
we commit to vigilant monitoring and continual Sağnak Taşırlar. 2023. Introducing our Multimodal
enhancement of relevant details on our website, Models.
incorporating fresh insights as they emerge.
Ali Furkan Biten, Ron Litman, Yusheng Xie, Srikar
Appalaraju, and R Manmatha. 2022. Latr: Layout-
References aware transformer for scene-text vqa. In Proceedings
of the IEEE/CVF conference on computer vision and
Emanuele Aiello, Lili Yu, Yixin Nie, Armen Agha- pattern recognition, pages 16548–16558.
janyan, and Barlas Oguz. 2023. Jointly Training
Large Autoregressive Multimodal Models. arXiv Andy Brock, Soham De, Samuel L Smith, and Karen Si-
preprint arXiv:2309.15564. monyan. 2021. High-performance large-scale image
recognition without normalization. In International
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Conference on Machine Learning, pages 1059–1071.
Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. PMLR.
2021. Vatt: Transformers for multimodal self-
supervised learning from raw video, audio and text. Tom Brown, Benjamin Mann, Nick Ryder, Melanie
Advances in Neural Information Processing Systems, Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind
34:24206–24221. Neelakantan, Pranav Shyam, Girish Sastry, Amanda
Askell, et al. 2020. Language models are few-shot
Afra Feyza Akyürek, Ekin Akyürek, Aman Madaan, learners. Advances in neural information processing
Ashwin Kalyan, Peter Clark, Derry Wijaya, and systems, 33:1877–1901.
Niket Tandon. 2023. RL4F: Generating Natu-
ral Language Feedback with Reinforcement Learn- Minwoo Byeon, Beomhee Park, Haecheon Kim,
ing for Repairing Model Outputs. arXiv preprint Sungjun Lee, Woonhyuk Baek, and Saehoon Kim.
arXiv:2305.08844. 2022. Coyo-700m: Image-text pair dataset.
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc,
Antoine Miech, Iain Barr, Yana Hasson, Karel Fabian Caba Heilbron, Victor Escorcia, Bernard
Lenc, Arthur Mensch, Katherine Millican, Malcolm Ghanem, and Juan Carlos Niebles. 2015. Activitynet:
Reynolds, et al. 2022. Flamingo: a visual language A large-scale video benchmark for human activity
model for few-shot learning. Advances in Neural understanding. In Proceedings of the ieee conference
Information Processing Systems, 35:23716–23736. on computer vision and pattern recognition, pages
961–970.
Akari Asai, Sewon Min, Zexuan Zhong, and Danqi
Chen. 2023. Retrieval-based language models and Cerspense. 2023. Zeroscope: Diffusion-based text-to-
applications. In Proceedings of the 61st Annual Meet- video synthesis.
ing of the Association for Computational Linguistics
(Volume 6: Tutorial Abstracts), pages 41–46. Soravit Changpinyo, Piyush Sharma, Nan Ding, and
Radu Soricut. 2021. Conceptual 12m: Pushing web-
Anas Awadalla, Irena Gao, Josh Gardner, Jack Hes- scale image-text pre-training to recognize long-tail
sel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, visual concepts. In Proceedings of the IEEE/CVF
Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Conference on Computer Vision and Pattern Recog-
2023. Openflamingo: An open-source framework for nition, pages 3558–3568.
Fei-Long Chen, Du-Zhen Zhang, Ming-Lun Han, Xiu- with humans via natural language feedback. arXiv
Yi Chen, Jing Shi, Shuang Xu, and Bo Xu. 2023a. preprint arXiv:2311.10081.
Vlp: A survey on vision-language pre-training. Ma-
chine Intelligence Research, 20(1):38–56. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng,
Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan
Feilong Chen, Minglun Han, Haozhi Zhao, Qingyang Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion
Zhang, Jing Shi, Shuang Xu, and Bo Xu. 2023b. X- Stoica, and Eric P. Xing. 2023. Vicuna: An Open-
llm: Bootstrapping advanced large language models Source Chatbot Impressing GPT-4 with 90%* Chat-
by treating multi-modalities as foreign languages. GPT Quality.
arXiv preprint arXiv:2305.04160.
Aakanksha Chowdhery, Sharan Narang, Jacob Devlin,
Jun Chen, Deyao Zhu, Xiaoqian Shen, Xiang Li, Zechun Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul
Liu, Pengchuan Zhang, Raghuraman Krishnamoor- Barham, Hyung Won Chung, Charles Sutton, Sebas-
thi, Vikas Chandra, Yunyang Xiong, and Mohamed tian Gehrmann, et al. 2023. Palm: Scaling language
Elhoseiny. 2023c. Minigpt-v2: large language model modeling with pathways. Journal of Machine Learn-
as a unified interface for vision-language multi-task ing Research, 24(240):1–113.
learning. arXiv preprint arXiv:2310.09478.
Xiangxiang Chu, Limeng Qiao, Xinyang Lin, Shuang
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Xu, Yang Yang, Yiming Hu, Fei Wei, Xinyu
Feng Zhu, and Rui Zhao. 2023d. Shikra: Unleash- Zhang, Bo Zhang, Xiaolin Wei, et al. 2023a. Mo-
ing Multimodal LLM’s Referential Dialogue Magic. bilevlm: A fast, reproducible and strong vision lan-
arXiv preprint arXiv:2306.15195. guage assistant for mobile devices. arXiv preprint
arXiv:2312.16886.
Lin Chen, Jisong Li, Xiaoyi Dong, Pan Zhang, Con-
ghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shil-
2023e. ShareGPT4V: Improving Large Multi- iang Zhang, Zhijie Yan, Chang Zhou, and Jingren
Modal Models with Better Captions. arXiv preprint Zhou. 2023b. Qwen-audio: Advancing universal
arXiv:2311.12793. audio understanding via unified large-scale audio-
language models. arXiv preprint arXiv:2311.07919.
Sanyuan Chen, Yu Wu, Chengyi Wang, Shujie Liu,
Hyung Won Chung, Le Hou, Shayne Longpre, Barret
Daniel Tompkins, Zhuo Chen, Wanxiang Che, Xi-
Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi
angzhan Yu, and Furu Wei. 2023f. BEATs: Audio
Wang, Mostafa Dehghani, Siddhartha Brahma, et al.
Pre-Training with Acoustic Tokenizers. In Interna-
2022. Scaling instruction-finetuned language models.
tional Conference on Machine Learning, ICML 2023,
arXiv preprint arXiv:2210.11416.
23-29 July 2023, Honolulu, Hawaii, USA, pages
5178–5193. Can Cui, Yunsheng Ma, Xu Cao, Wenqian Ye, Yang
Zhou, Kaizhao Liang, Jintai Chen, Juanwu Lu, Zi-
Shoufa Chen, Chongjian Ge, Zhan Tong, Jiangliu Wang, chong Yang, Kuei-Da Liao, et al. 2024. A sur-
Yibing Song, Jue Wang, and Ping Luo. 2022a. Adapt- vey on multimodal large language models for au-
former: Adapting vision transformers for scalable tonomous driving. In Proceedings of the IEEE/CVF
visual recognition. Advances in Neural Information Winter Conference on Applications of Computer Vi-
Processing Systems, 35:16664–16678. sion, pages 958–979.
Xi Chen, Josip Djolonga, Piotr Padlewski, Basil Wenliang Dai, Junnan Li, Dongxu Li, Anthony
Mustafa, Soravit Changpinyo, Jialin Wu, Car- Meng Huat Tiong, Junqi Zhao, Weisheng Wang,
los Riquelme Ruiz, Sebastian Goodman, Xiao Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi.
Yi Tay, et al. 2023g. PaLI-X: On Scaling up a Multi- 2023. InstructBLIP: Towards General-purpose
lingual Vision and Language Model. arXiv preprint Vision-Language Models with Instruction Tuning.
arXiv:2305.18565. In Thirty-seventh Conference on Neural Information
Processing Systems.
Xi Chen, Xiao Wang, Soravit Changpinyo, AJ Pier-
giovanni, Piotr Padlewski, Daniel Salz, Sebastian Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, and
Goodman, Adam Grycner, Basil Mustafa, Lucas Luke Zettlemoyer. 2023. Qlora: Efficient finetuning
Beyer, et al. 2022b. Pali: A jointly-scaled mul- of quantized llms. arXiv preprint arXiv:2305.14314.
tilingual language-image model. arXiv preprint
arXiv:2209.06794. Jiahua Dong, Wenqi Liang, Yang Cong, and Gan Sun.
2023a. Heterogeneous forgetting compensation for
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakr- class-incremental learning. In Proceedings of the
ishna Vedantam, Saurabh Gupta, Piotr Dollár, and IEEE/CVF International Conference on Computer
C Lawrence Zitnick. 2015. Microsoft coco captions: Vision, pages 11742–11751.
Data collection and evaluation server. arXiv preprint
arXiv:1504.00325. Jiahua Dong, Duzhen Zhang, Yang Cong, Wei Cong,
Henghui Ding, and Dengxin Dai. 2023b. Federated
Yangyi Chen, Karan Sikka, Michael Cogswell, Heng Incremental Semantic Segmentation. In Proceedings
Ji, and Ajay Divakaran. 2023h. Dress: Instructing of the IEEE/CVF Conference on Computer Vision
large vision-language models to align and interact and Pattern Recognition, pages 3934–3943.
Linhao Dong and Bo Xu. 2020. Cif: Continuous Chin-Lun Fu, Zih-Ching Chen, Yun-Ru Lee, and Hung-
integrate-and-fire for end-to-end speech recognition. Yi Lee. 2022. AdapterBias: Parameter-efficient
In ICASSP 2020-2020 IEEE International Confer- Token-dependent Representation Shift for Adapters
ence on Acoustics, Speech and Signal Processing in NLP Tasks. In Findings of the Association for
(ICASSP), pages 6079–6083. IEEE. Computational Linguistics: NAACL 2022, pages
2608–2621.
Alexey Dosovitskiy, Lucas Beyer, Alexander
Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang,
Thomas Unterthiner, Mostafa Dehghani, Matthias Jonathan Hayase, Georgios Smyrnis, Thao Nguyen,
Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. Ryan Marten, Mitchell Wortsman, Dhruba Ghosh,
An Image is Worth 16x16 Words: Transformers Jieyu Zhang, et al. 2023. Datacomp: In search of
for Image Recognition at Scale. In International the next generation of multimodal datasets. arXiv
Conference on Learning Representations. preprint arXiv:2304.14108.
Danny Driess, Fei Xia, Mehdi SM Sajjadi, Corey Lynch, Rohit Girdhar, Alaaeldin El-Nouby, Zhuang Liu, Man-
Aakanksha Chowdhery, Brian Ichter, Ayzaan Wahid, nat Singh, Kalyan Vasudev Alwala, Armand Joulin,
Jonathan Tompson, Quan Vuong, Tianhe Yu, et al. and Ishan Misra. 2023. Imagebind: One embed-
2023. Palm-e: An embodied multimodal language ding space to bind them all. In Proceedings of the
model. arXiv preprint arXiv:2303.03378. IEEE/CVF Conference on Computer Vision and Pat-
Yifan Du, Zikang Liu, Junyi Li, and Wayne Xin Zhao. tern Recognition, pages 15180–15190.
2022a. A Survey of Vision-Language Pre-Trained
Models. In Proceedings of the Thirty-First Inter- Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang,
national Joint Conference on Artificial Intelligence, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang,
IJCAI 2022, Vienna, Austria, 23-29 July 2022, pages Ping Luo, and Kai Chen. 2023. Multimodal-gpt: A
5436–5443. vision and language model for dialogue with humans.
arXiv preprint arXiv:2305.04790.
Zhengxiao Du, Yujie Qian, Xiao Liu, Ming Ding,
Jiezhong Qiu, Zhilin Yang, and Jie Tang. 2022b. Ian J Goodfellow, Mehdi Mirza, Da Xiao, Aaron
GLM: General Language Model Pretraining with Au- Courville, and Yoshua Bengio. 2013. An em-
toregressive Blank Infilling. In Proceedings of the pirical investigation of catastrophic forgetting in
60th Annual Meeting of the Association for Compu- gradient-based neural networks. arXiv preprint
tational Linguistics (Volume 1: Long Papers), pages arXiv:1312.6211.
320–335.
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv
Han Fang, Pengfei Xiong, Luhui Xu, and Yu Chen. Batra, and Devi Parikh. 2017. Making the v in vqa
2021. Clip2video: Mastering video-text retrieval via matter: Elevating the role of image understanding
image clip. arXiv preprint arXiv:2106.11097. in visual question answering. In Proceedings of the
IEEE conference on computer vision and pattern
Yuxin Fang, Wen Wang, Binhui Xie, Quan Sun, Ledell recognition, pages 6904–6913.
Wu, Xinggang Wang, Tiejun Huang, Xinlong Wang,
and Yue Cao. 2023. Eva: Exploring the limits of Jiaxi Gu, Xiaojun Meng, Guansong Lu, Lu Hou, Niu
masked visual representation learning at scale. In Minzhe, Xiaodan Liang, Lewei Yao, Runhui Huang,
Proceedings of the IEEE/CVF Conference on Com- Wei Zhang, Xin Jiang, et al. 2022. Wukong: A 100
puter Vision and Pattern Recognition, pages 19358– million large-scale chinese cross-modal pre-training
19369. benchmark. Advances in Neural Information Pro-
cessing Systems, 35:26418–26431.
Hao Feng, Qi Liu, Hao Liu, Wengang Zhou, Houqiang
Li, and Can Huang. 2023. DocPedia: Unleashing
the Power of Large Multimodal Model in the Fre- Danna Gurari, Qing Li, Abigale J Stangl, Anhong Guo,
quency Domain for Versatile Document Understand- Chi Lin, Kristen Grauman, Jiebo Luo, and Jeffrey P
ing. arXiv preprint arXiv:2311.11810. Bigham. 2018. Vizwiz grand challenge: Answering
visual questions from blind people. In Proceedings of
Roya Firoozi, Johnathan Tucker, Stephen Tian, the IEEE conference on computer vision and pattern
Anirudha Majumdar, Jiankai Sun, Weiyu Liu, Yuke recognition, pages 3608–3617.
Zhu, Shuran Song, Ashish Kapoor, Karol Hausman,
et al. 2023. Foundation Models in Robotics: Appli- Jinghan He, Haiyun Guo, Ming Tang, and Jinqiao Wang.
cations, Challenges, and the Future. arXiv preprint 2023. Continual instruction tuning for large multi-
arXiv:2312.07843. modal models. arXiv preprint arXiv:2311.16206.

Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-
Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Kirkpatrick, and Graham Neubig. 2021. Towards a
Ke Li, Xing Sun, et al. 2023. Mme: A comprehensive Unified View of Parameter-Efficient Transfer Learn-
evaluation benchmark for multimodal large language ing. In International Conference on Learning Repre-
models. arXiv preprint arXiv:2306.13394. sentations.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Srinivasan Iyer, Xi Victoria Lin, Ramakanth Pasunuru,
Sun. 2016. Deep residual learning for image recog- Todor Mihaylov, Daniel Simig, Ping Yu, Kurt Shus-
nition. In Proceedings of the IEEE conference on ter, Tianlu Wang, Qing Liu, Punit Singh Koura, et al.
computer vision and pattern recognition, pages 770– 2022. Opt-iml: Scaling language model instruc-
778. tion meta learning through the lens of generalization.
arXiv preprint arXiv:2212.12017.
Jordan Hoffmann, Sebastian Borgeaud, Arthur Men-
sch, Elena Buchatskaya, Trevor Cai, Eliza Ruther- Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana
ford, Diego de Las Casas, Lisa Anne Hendricks, Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen
Johannes Welbl, Aidan Clark, et al. 2022. Train- Li, and Tom Duerig. 2021. Scaling up visual and
ing compute-optimal large language models. arXiv vision-language representation learning with noisy
preprint arXiv:2203.15556. text supervision. In International conference on ma-
chine learning, pages 4904–4916. PMLR.
Or Honovich, Thomas Scialom, Omer Levy, and Timo
Schick. 2022. Unnatural instructions: Tuning lan- Yiren Jian, Chongyang Gao, and Soroush Vosoughi.
guage models with (almost) no human labor. arXiv 2023. Bootstrapping Vision-Language Learning with
preprint arXiv:2212.09689. Decoupled Language Pre-training. In Thirty-seventh
Conference on Neural Information Processing Sys-
Neil Houlsby, Andrei Giurgiu, Stanislaw Jastrzebski, tems.
Bruna Morrone, Quentin De Laroussilhe, Andrea
Gesmundo, Mona Attariyan, and Sylvain Gelly. 2019. Kushal Kafle, Brian Price, Scott Cohen, and Christo-
Parameter-efficient transfer learning for NLP. In In- pher Kanan. 2018. Dvqa: Understanding data visual-
ternational Conference on Machine Learning, pages izations via question answering. In Proceedings of
2790–2799. PMLR. the IEEE conference on computer vision and pattern
recognition, pages 5648–5656.
Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai,
Kushal Lakhotia, Ruslan Salakhutdinov, and Abdel- Rabeeh Karimi Mahabadi, James Henderson, and Se-
rahman Mohamed. 2021. Hubert: Self-supervised bastian Ruder. 2021. Compacter: Efficient low-rank
speech representation learning by masked prediction hypercomplex adapter layers. Advances in Neural
of hidden units. IEEE/ACM Transactions on Audio, Information Processing Systems, 34:1022–1035.
Speech, and Language Processing, 29:3451–3460. Sahar Kazemzadeh, Vicente Ordonez, Mark Matten,
and Tamara Berg. 2014. Referitgame: Referring to
Edward J Hu, Phillip Wallis, Zeyuan Allen-Zhu,
objects in photographs of natural scenes. In Proceed-
Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen,
ings of the 2014 conference on empirical methods in
et al. 2021. LoRA: Low-Rank Adaptation of Large
natural language processing (EMNLP), pages 787–
Language Models. In International Conference on
798.
Learning Representations.
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina
Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, and Toutanova. 2019. BERT: Pre-training of Deep Bidi-
Shijian Lu. 2023a. Visual Instruction Tuning to- rectional Transformers for Language Understanding.
wards General-Purpose Multimodal Model: A Sur- In Proceedings of NAACL-HLT, pages 4171–4186.
vey. arXiv preprint arXiv:2312.16602.
Douwe Kiela, Hamed Firooz, Aravind Mohan, Vedanuj
Rongjie Huang, Mingze Li, Dongchao Yang, Jia- Goswami, Amanpreet Singh, Pratik Ringshia, and
tong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Davide Testuggine. 2020. The hateful memes chal-
Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. lenge: Detecting hate speech in multimodal memes.
2023b. Audiogpt: Understanding and generating Advances in neural information processing systems,
speech, music, sound, and talking head. arXiv 33:2611–2624.
preprint arXiv:2304.12995.
Diederik P Kingma and Max Welling. 2013. Auto-
Shaohan Huang, Li Dong, Wenhui Wang, Yaru Hao, encoding variational bayes. arXiv preprint
Saksham Singhal, Shuming Ma, Tengchao Lv, Lei arXiv:1312.6114.
Cui, Owais Khan Mohammed, Qiang Liu, et al.
2023c. Language is not all you need: Aligning Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin John-
perception with language models. arXiv preprint son, Kenji Hata, Joshua Kravitz, Stephanie Chen,
arXiv:2302.14045. Yannis Kalantidis, Li-Jia Li, David A Shamma, et al.
2017. Visual genome: Connecting language and vi-
Drew A Hudson and Christopher D Manning. 2019. sion using crowdsourced dense image annotations.
Gqa: A new dataset for real-world visual reasoning International journal of computer vision, 123:32–73.
and compositional question answering. In Proceed-
ings of the IEEE/CVF conference on computer vision Brian Lester, Rami Al-Rfou, and Noah Constant. 2021.
and pattern recognition, pages 6700–6709. The Power of Scale for Parameter-Efficient Prompt
Tuning. In Proceedings of the 2021 Conference on
IDEFICS. 2023. Introducing IDEFICS: An Open Repro- Empirical Methods in Natural Language Processing,
duction of State-of-the-Art Visual Language Model. pages 3045–3059.
Bo Li, Yuanhan Zhang, Liangyu Chen, Jinghao Wang, Zhang Li, Biao Yang, Qiang Liu, Zhiyin Ma, Shuo
Fanyi Pu, Jingkang Yang, Chunyuan Li, and Ziwei Zhang, Jingxu Yang, Yabo Sun, Yuliang Liu, and
Liu. 2023a. Mimic-it: Multi-modal in-context in- Xiang Bai. 2023h. Monkey: Image Resolution and
struction tuning. arXiv preprint arXiv:2306.05425. Text Label Are Important Things for Large Multi-
modal Models. arXiv preprint arXiv:2311.06607.
Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yix-
iao Ge, and Ying Shan. 2023b. Seed-bench: Bench- Hongzhan Lin, Ziyang Luo, Bo Wang, Ruichao Yang,
marking multimodal llms with generative compre- and Jing Ma. 2024. GOAT-Bench: Safety Insights
hension. arXiv preprint arXiv:2307.16125. to Large Multimodal Models through Meme-Based
Social Abuse. arXiv preprint arXiv:2401.01523.
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H.
Hoi. 2023c. BLIP-2: Bootstrapping Language-Image Ji Lin, Hongxu Yin, Wei Ping, Yao Lu, Pavlo
Pre-training with Frozen Image Encoders and Large Molchanov, Andrew Tao, Huizi Mao, Jan Kautz,
Language Models. In International Conference on Mohammad Shoeybi, and Song Han. 2023. VILA:
Machine Learning, ICML 2023, 23-29 July 2023, On Pre-training for Visual Language Models. arXiv
Honolulu, Hawaii, USA, pages 19730–19742. preprint arXiv:2312.07533.
Junnan Li, Dongxu Li, Caiming Xiong, and Steven
Hoi. 2022. Blip: Bootstrapping language-image pre- Tsung-Yi Lin, Michael Maire, Serge Belongie, James
training for unified vision-language understanding Hays, Pietro Perona, Deva Ramanan, Piotr Dollár,
and generation. In International Conference on Ma- and C Lawrence Zitnick. 2014. Microsoft coco:
chine Learning, pages 12888–12900. PMLR. Common objects in context. In Computer Vision–
ECCV 2014: 13th European Conference, Zurich,
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Switzerland, September 6-12, 2014, Proceedings,
Shafiq Joty, Caiming Xiong, and Steven Chu Hong Part V 13, pages 740–755. Springer.
Hoi. 2021. Align before fuse: Vision and language
representation learning with momentum distillation. Fangyu Liu, Guy Emerson, and Nigel Collier. 2023a.
Advances in neural information processing systems, Visual spatial reasoning. Transactions of the Associ-
34:9694–9705. ation for Computational Linguistics, 11:635–651.

KunChang Li, Yinan He, Yi Wang, Yizhuo Li, Wen- Haohe Liu, Zehua Chen, Yi Yuan, Xinhao Mei, Xubo
hai Wang, Ping Luo, Yali Wang, Limin Wang, and Liu, Danilo P. Mandic, Wenwu Wang, and Mark D.
Yu Qiao. 2023d. Videochat: Chat-centric video un- Plumbley. 2023b. AudioLDM: Text-to-Audio Gener-
derstanding. arXiv preprint arXiv:2305.06355. ation with Latent Diffusion Models. In International
Conference on Machine Learning, ICML 2023, 23-
Lei Li, Yuwei Yin, Shicheng Li, Liang Chen, Peiyi 29 July 2023, Honolulu, Hawaii, USA, pages 21450–
Wang, Shuhuai Ren, Mukai Li, Yazheng Yang, 21474.
Jingjing Xu, Xu Sun, et al. 2023e. M3 IT: A Large-
Scale Dataset towards Multi-Modal Multilingual In- Haohe Liu, Qiao Tian, Yi Yuan, Xubo Liu, Xinhao
struction Tuning. arXiv preprint arXiv:2306.04387. Mei, Qiuqiang Kong, Yuping Wang, Wenwu Wang,
Yuxuan Wang, and Mark D. Plumbley. 2023c. Audi-
Xiang Lisa Li and Percy Liang. 2021. Prefix-Tuning:
oLDM 2: Learning Holistic Audio Generation with
Optimizing Continuous Prompts for Generation. In
Self-supervised Pretraining. CoRR, abs/2308.05734.
Proceedings of the 59th Annual Meeting of the Asso-
ciation for Computational Linguistics and the 11th Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae
International Joint Conference on Natural Language Lee. 2023d. Improved Baselines with Visual Instruc-
Processing (Volume 1: Long Papers), pages 4582– tion Tuning. In NeurIPS 2023 Workshop on Instruc-
4597. tion Tuning and Instruction Following.
Xiujun Li, Xi Yin, Chunyuan Li, Pengchuan Zhang,
Xiaowei Hu, Lei Zhang, Lijuan Wang, Houdong Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae
Hu, Li Dong, Furu Wei, et al. 2020. Oscar: Object- Lee. 2023e. Visual Instruction Tuning. In Thirty-
semantics aligned pre-training for vision-language seventh Conference on Neural Information Process-
tasks. In Computer Vision–ECCV 2020: 16th Euro- ing Systems.
pean Conference, Glasgow, UK, August 23–28, 2020,
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li,
Proceedings, Part XXX 16, pages 121–137. Springer.
Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi
Yanda Li, Chi Zhang, Gang Yu, Zhibin Wang, Bin Wang, Conghui He, Ziwei Liu, et al. 2023f. Mm-
Fu, Guosheng Lin, Chunhua Shen, Ling Chen, and bench: Is your multi-modal model an all-around
Yunchao Wei. 2023f. Stablellava: Enhanced visual player? arXiv preprint arXiv:2307.06281.
instruction tuning with synthesized image-dialogue
data. arXiv preprint arXiv:2308.10253. Siqu Long, Feiqi Cao, Soyeon Caren Han, and Haiqin
Yang. 2022. Vision-and-Language Pretrained Mod-
Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, els: A Survey. In Proceedings of the Thirty-First
Wayne Xin Zhao, and Ji-Rong Wen. 2023g. Eval- International Joint Conference on Artificial Intelli-
uating object hallucination in large vision-language gence, IJCAI 2022, Vienna, Austria, 23-29 July 2022,
models. arXiv preprint arXiv:2305.10355. pages 5530–5537.
Pan Lu, Swaroop Mishra, Tanglin Xia, Liang Qiu, Kai- OpenAI. 2023. GPT-4 Technical Report.
Wei Chang, Song-Chun Zhu, Oyvind Tafjord, Peter
Clark, and Ashwin Kalyan. 2022. Learn to explain: Vicente Ordonez, Girish Kulkarni, and Tamara Berg.
Multimodal reasoning via thought chains for science 2011. Im2text: Describing images using 1 million
question answering. Advances in Neural Information captioned photographs. Advances in neural informa-
Processing Systems, 35:2507–2521. tion processing systems, 24.
Pan Lu, Liang Qiu, Jiaqi Chen, Tony Xia, Yizhou Zhao, Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida,
Wei Zhang, Zhou Yu, Xiaodan Liang, and Song-Chun Carroll Wainwright, Pamela Mishkin, Chong Zhang,
Zhu. 2021. Iconqa: A new benchmark for abstract Sandhini Agarwal, Katarina Slama, Alex Ray, et al.
diagram understanding and visual language reason- 2022. Training language models to follow instruc-
ing. In Thirty-fifth Conference on Neural Information tions with human feedback. Advances in Neural
Processing Systems Datasets and Benchmarks Track Information Processing Systems, 35:27730–27744.
(Round 2).
Artemis Panagopoulou, Le Xue, Ning Yu, Junnan Li,
Ziyang Luo, Can Xu, Pu Zhao, Qingfeng Sun, Xiubo Dongxu Li, Shafiq Joty, Ran Xu, Silvio Savarese,
Geng, Wenxiang Hu, Chongyang Tao, Jing Ma, Qing- Caiming Xiong, and Juan Carlos Niebles. 2023. X-
wei Lin, and Daxin Jiang. 2023. WizardCoder: Em- InstructBLIP: A Framework for aligning X-Modal
powering Code Large Language Models with Evol- instruction-aware representations to LLMs and Emer-
Instruct. arXiv preprint arXiv:2306.08568. gent Cross-modal Reasoning. arXiv preprint
Muhammad Maaz, Hanoona Rasheed, Salman Khan, arXiv:2311.18799.
and Fahad Shahbaz Khan. 2023. Video-ChatGPT:
Towards Detailed Video Understanding via Large Zhiliang Peng, Wenhui Wang, Li Dong, Yaru Hao,
Vision and Language Models. arXiv preprint Shaohan Huang, Shuming Ma, and Furu Wei.
arXiv:2306.05424. 2023. Kosmos-2: Grounding Multimodal Large
Language Models to the World. arXiv preprint
Minesh Mathew, Dimosthenis Karatzas, and CV Jawa- arXiv:2306.14824.
har. 2021. Docvqa: A dataset for vqa on document
images. In Proceedings of the IEEE/CVF winter con- Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya
ference on applications of computer vision, pages Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sas-
2200–2209. try, Amanda Askell, Pamela Mishkin, Jack Clark,
et al. 2021. Learning transferable visual models from
Michael McCloskey and Neal J Cohen. 1989. Catas- natural language supervision. In International confer-
trophic interference in connectionist networks: The ence on machine learning, pages 8748–8763. PMLR.
sequential learning problem. In Psychology of learn-
ing and motivation, volume 24, pages 109–165. Else- Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock-
vier. man, Christine McLeavey, and Ilya Sutskever. 2023.
Robust Speech Recognition via Large-Scale Weak
Xinhao Mei, Chutong Meng, Haohe Liu, Qiuqiang Supervision. In International Conference on Ma-
Kong, Tom Ko, Chengqi Zhao, Mark D Plumbley, chine Learning, ICML 2023, 23-29 July 2023, Hon-
Yuexian Zou, and Wenwu Wang. 2023. Wavcaps: olulu, Hawaii, USA, pages 28492–28518.
A chatgpt-assisted weakly-labelled audio caption-
ing dataset for audio-language multimodal research. Colin Raffel, Noam Shazeer, Adam Roberts, Katherine
arXiv preprint arXiv:2303.17395. Lee, Sharan Narang, Michael Matena, Yanqi Zhou,
Wei Li, and Peter J Liu. 2020. Exploring the limits
Anand Mishra, Shashank Shekhar, Ajeet Kumar Singh, of transfer learning with a unified text-to-text trans-
and Anirban Chakraborty. 2019. Ocr-vqa: Visual former. The Journal of Machine Learning Research,
question answering by reading text in images. In 21(1):5485–5551.
2019 international conference on document analysis
and recognition (ICDAR), pages 947–952. IEEE. Sylvestre-Alvise Rebuffi, Hakan Bilen, and Andrea
Yao Mu, Qinglong Zhang, Mengkang Hu, Wenhai Vedaldi. 2017. Learning multiple visual domains
Wang, Mingyu Ding, Jun Jin, Bin Wang, Jifeng Dai, with residual adapters. Advances in neural informa-
Yu Qiao, and Ping Luo. 2023. Embodiedgpt: Vision- tion processing systems, 30.
language pre-training via embodied chain of thought.
In Thirty-seventh Conference on Neural Information Anthony Robins. 1995. Catastrophic forgetting, re-
Processing Systems. hearsal and pseudorehearsal. Connection Science,
7(2):123–146.
Humza Naveed, Asad Ullah Khan, Shi Qiu, Muham-
mad Saqib, Saeed Anwar, Muhammad Usman, Nick Robin Rombach, Andreas Blattmann, Dominik Lorenz,
Barnes, and Ajmal Mian. 2023. A comprehensive Patrick Esser, and Björn Ommer. 2022. High-
overview of large language models. arXiv preprint resolution image synthesis with latent diffusion mod-
arXiv:2307.06435. els. In Proceedings of the IEEE/CVF conference
on computer vision and pattern recognition, pages
OpenAI. 2022. OpenAI: Introducing ChatGPT. 10684–10695.
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. Shezheng Song, Xiaopeng Li, and Shasha Li. 2023.
2015. U-net: Convolutional networks for biomedical How to Bridge the Gap between Modalities: A Com-
image segmentation. In Medical Image Computing prehensive Survey on Multimodal Large Language
and Computer-Assisted Intervention–MICCAI 2015: Model. arXiv preprint arXiv:2311.07594.
18th International Conference, Munich, Germany,
October 5-9, 2015, Proceedings, Part III 18, pages Yixuan Su, Tian Lan, Huayang Li, Jialu Xu, Yan
234–241. Springer. Wang, and Deng Cai. 2023. Pandagpt: One
model to instruction-follow them all. arXiv preprint
Ludan Ruan and Qin Jin. 2022. Survey: Transformer arXiv:2305.16355.
based video-language pre-training. AI Open, 3:1–13.
Zhiqing Sun, Sheng Shen, Shengcao Cao, Haotian Liu,
Salesforce. 2022. Ulip. Chunyuan Li, Yikang Shen, Chuang Gan, Liang-
Yan Gui, Yu-Xiong Wang, Yiming Yang, et al. 2023.
Christoph Schuhmann, Romain Beaumont, Richard Aligning large multimodal models with factually aug-
Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, mented rlhf. arXiv preprint arXiv:2309.14525.
Theo Coombes, Aarush Katta, Clayton Mullis,
Mitchell Wortsman, et al. 2022. Laion-5b: An open Dídac Surís, Sachit Menon, and Carl Vondrick. 2023.
large-scale dataset for training next generation image- Vipergpt: Visual inference via python execution for
text models. Advances in Neural Information Pro- reasoning. arXiv preprint arXiv:2303.08128.
cessing Systems, 35:25278–25294.
Zineng Tang, Ziyi Yang, Mahmoud Khademi, Yang Liu,
Christoph Schuhmann, Andreas Köpf, Richard Vencu, Chenguang Zhu, and Mohit Bansal. 2023a. CoDi-2:
Theo Coombes, and Romain Beaumont. 2022b. In-Context, Interleaved, and Interactive Any-to-Any
Laion coco: 600m synthetic captions from laion2b- Generation. arXiv preprint arXiv:2311.18775.
en.
Zineng Tang, Ziyi Yang, Chenguang Zhu, Michael Zeng,
Christoph Schuhmann, Richard Vencu, Romain Beau-
and Mohit Bansal. 2023b. Any-to-Any Generation
mont, Robert Kaczmarczyk, Clayton Mullis, Aarush
via Composable Diffusion. In Thirty-seventh Confer-
Katta, Theo Coombes, Jenia Jitsev, and Aran Komat-
ence on Neural Information Processing Systems.
suzaki. 2021. Laion-400m: Open dataset of clip-
filtered 400 million image-text pairs. arXiv preprint
arXiv:2111.02114. Yi Tay, Mostafa Dehghani, Vinh Q Tran, Xavier Gar-
cia, Jason Wei, Xuezhi Wang, Hyung Won Chung,
Dustin Schwenk, Apoorv Khandelwal, Christopher Dara Bahri, Tal Schuster, Steven Zheng, et al. 2022.
Clark, Kenneth Marino, and Roozbeh Mottaghi. 2022. Ul2: Unifying language learning paradigms. In The
A-okvqa: A benchmark for visual question answer- Eleventh International Conference on Learning Rep-
ing using world knowledge. In European Conference resentations.
on Computer Vision, pages 146–162. Springer.
Gemini Team, Rohan Anil, Sebastian Borgeaud,
Piyush Sharma, Nan Ding, Sebastian Goodman, and Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu,
Radu Soricut. 2018. Conceptual captions: A cleaned, Radu Soricut, Johan Schalkwyk, Andrew M Dai,
hypernymed, image alt-text dataset for automatic im- Anja Hauth, et al. 2023. Gemini: a family of
age captioning. In Proceedings of the 56th Annual highly capable multimodal models. arXiv preprint
Meeting of the Association for Computational Lin- arXiv:2312.11805.
guistics (Volume 1: Long Papers), pages 2556–2565.
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Martinet, Marie-Anne Lachaux, Timothée Lacroix,
Weiming Lu, and Yueting Zhuang. 2023. Hugging- Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal
gpt: Solving ai tasks with chatgpt and its friends in Azhar, et al. 2023a. Llama: Open and effi-
huggingface. arXiv preprint arXiv:2303.17580. cient foundation language models. arXiv preprint
arXiv:2302.13971.
Oleksii Sidorov, Ronghang Hu, Marcus Rohrbach, and
Amanpreet Singh. 2020. Textcaps: a dataset for im- Hugo Touvron, Louis Martin, Kevin Stone, Peter Al-
age captioning with reading comprehension. In Com- bert, Amjad Almahairi, Yasmine Babaei, Nikolay
puter Vision–ECCV 2020: 16th European Confer- Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti
ence, Glasgow, UK, August 23–28, 2020, Proceed- Bhosale, et al. 2023b. Llama 2: Open founda-
ings, Part II 16, pages 742–758. Springer. tion and fine-tuned chat models. arXiv preprint
arXiv:2307.09288.
Amanpreet Singh, Vivek Natarajan, Meet Shah,
Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
and Marcus Rohrbach. 2019. Towards vqa models Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz
that can read. In Proceedings of the IEEE/CVF con- Kaiser, and Illia Polosukhin. 2017. Attention is all
ference on computer vision and pattern recognition, you need. Advances in neural information processing
pages 8317–8326. systems, 30.
Peng Wang, An Yang, Rui Men, Junyang Lin, Shuai Rongtao Xu, Changwei Wang, Jiguang Zhang, Shibiao
Bai, Zhikang Li, Jianxin Ma, Chang Zhou, Jingren Xu, Weiliang Meng, and Xiaopeng Zhang. 2023b.
Zhou, and Hongxia Yang. 2022a. Ofa: Unifying ar- Rssformer: Foreground saliency enhancement for re-
chitectures, tasks, and modalities through a simple mote sensing land-cover segmentation. IEEE Trans-
sequence-to-sequence learning framework. In Inter- actions on Image Processing, 32:1052–1064.
national Conference on Machine Learning, pages
23318–23340. PMLR. Rui Yan, Mike Zheng Shou, Yixiao Ge, Alex Jinpeng
Wang, Xudong Lin, Guanyu Cai, and Jinhui Tang.
Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi 2021. Video-text pre-training with learned regions.
Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei arXiv preprint arXiv:2112.01194.
Zhao, Xixuan Song, et al. 2023. Cogvlm: Visual ex-
pert for pretrained language models. arXiv preprint Jinyu Yang, Jiali Duan, Son Tran, Yi Xu, Sampath
arXiv:2311.03079. Chanda, Liqun Chen, Belinda Zeng, Trishul Chilimbi,
and Junzhou Huang. 2022. Vision-language pre-
Wenhui Wang, Hangbo Bao, Li Dong, Johan training with triple contrastive learning. In Proceed-
Bjorck, Zhiliang Peng, Qiang Liu, Kriti Aggarwal, ings of the IEEE/CVF Conference on Computer Vi-
Owais Khan Mohammed, Saksham Singhal, Subhojit sion and Pattern Recognition, pages 15671–15680.
Som, et al. 2022b. Image as a foreign language: Beit
pretraining for all vision and vision-language tasks. Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin
arXiv preprint arXiv:2208.10442. Lin, Ehsan Azarnasab, Faisal Ahmed, Zicheng Liu,
Ce Liu, Michael Zeng, and Lijuan Wang. 2023. Mm-
Jason Wei, Maarten Bosma, Vincent Zhao, Kelvin Guu, react: Prompting chatgpt for multimodal reasoning
Adams Wei Yu, Brian Lester, Nan Du, Andrew M and action. arXiv preprint arXiv:2303.11381.
Dai, and Quoc V Le. 2021. Finetuned Language
Models are Zero-Shot Learners. In International Qinghao Ye, Haiyang Xu, Guohai Xu, Jiabo Ye,
Conference on Learning Representations. Ming Yan, Yiyang Zhou, Junyang Wang, An-
wen Hu, Pengcheng Shi, Yaya Shi, et al. 2023.
Chenfei Wu, Shengming Yin, Weizhen Qi, Xiaodong mplug-owl: Modularization empowers large lan-
Wang, Zecheng Tang, and Nan Duan. 2023a. guage models with multimodality. arXiv preprint
Visual chatgpt: Talking, drawing and editing arXiv:2304.14178.
with visual foundation models. arXiv preprint
arXiv:2303.04671. Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun,
Tong Xu, and Enhong Chen. 2023a. A Survey on
Haoning Wu, Zicheng Zhang, Erli Zhang, Chaofeng Multimodal Large Language Models. arXiv preprint
Chen, Liang Liao, Annan Wang, Chunyi Li, Wenxiu arXiv:2306.13549.
Sun, Qiong Yan, Guangtao Zhai, et al. 2023b. Q-
bench: A benchmark for general-purpose founda- Zhenfei Yin, Jiong Wang, Jianjian Cao, Zhelun Shi,
tion models on low-level vision. arXiv preprint Dingning Liu, Mukai Li, Lu Sheng, Lei Bai, Xi-
arXiv:2309.14181. aoshui Huang, Zhiyong Wang, et al. 2023b. Lamm:
Language-assisted multi-modal instruction-tuning
Jiahong Wu, He Zheng, Bo Zhao, Yixin Li, Baoming dataset, framework, and benchmark. arXiv preprint
Yan, Rui Liang, Wenjia Wang, Shipei Zhou, Guosen arXiv:2306.06687.
Lin, Yanwei Fu, et al. 2017. Ai challenger: A large-
scale dataset for going deeper in image understanding. Peter Young, Alice Lai, Micah Hodosh, and Julia Hock-
arXiv preprint arXiv:1711.06475. enmaier. 2014. From image descriptions to visual
denotations: New similarity metrics for semantic in-
Jiayang Wu, Wensheng Gan, Zefeng Chen, Shicheng ference over event descriptions. Transactions of the
Wan, and Philip S Yu. 2023c. Multimodal large Association for Computational Linguistics, 2:67–78.
language models: A survey. arXiv preprint
arXiv:2311.13165. Licheng Yu, Patrick Poirson, Shan Yang, Alexander C
Berg, and Tamara L Berg. 2016. Modeling context
Shengqiong Wu, Hao Fei, Leigang Qu, Wei Ji, and in referring expressions. In Computer Vision–ECCV
Tat-Seng Chua. 2023d. Next-gpt: Any-to-any multi- 2016: 14th European Conference, Amsterdam, The
modal llm. arXiv preprint arXiv:2309.05519. Netherlands, October 11-14, 2016, Proceedings, Part
II 14, pages 69–85. Springer.
Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-
vtt: A large video description dataset for bridging Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang,
video and language. In Proceedings of the IEEE con- Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan
ference on computer vision and pattern recognition, Wang. 2023. Mm-vet: Evaluating large multimodal
pages 5288–5296. models for integrated capabilities. arXiv preprint
arXiv:2308.02490.
Rongtao Xu, Changwei Wang, Jiaxi Sun, Shibiao Xu,
Weiliang Meng, and Xiaopeng Zhang. 2023a. Self Xumin Yu, Lulu Tang, Yongming Rao, Tiejun Huang,
Correspondence Distillation For End-to-End Weakly- Jie Zhou, and Jiwen Lu. 2022. Point-bert: Pre-
Supervised Semantic Segmentation. In Proceedings training 3d point cloud transformers with masked
of the AAAI Conference on Artificial Intelligence. point modeling. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recog- Jeffrey O Zhang, Alexander Sax, Amir Zamir, Leonidas
nition, pages 19313–19322. Guibas, and Jitendra Malik. 2020. Side-tuning: a
baseline for network adaptation via additive side net-
Rowan Zellers, Jiasen Lu, Ximing Lu, Youngjae Yu, works. In Computer Vision–ECCV 2020: 16th Euro-
Yanpeng Zhao, Mohammadreza Salehi, Aditya Kusu- pean Conference, Glasgow, UK, August 23–28, 2020,
pati, Jack Hessel, Ali Farhadi, and Yejin Choi. 2022. Proceedings, Part III 16, pages 698–714. Springer.
Merlot reserve: Neural script knowledge through
vision and language and sound. In Proceedings of Susan Zhang, Stephen Roller, Naman Goyal, Mikel
the IEEE/CVF Conference on Computer Vision and Artetxe, Moya Chen, Shuohui Chen, Christopher De-
Pattern Recognition, pages 16375–16387. wan, Mona Diab, Xian Li, Xi Victoria Lin, et al.
2022b. Opt: Open pre-trained transformer language
Aohan Zeng, Xiao Liu, Zhengxiao Du, Zihan Wang, models. arXiv preprint arXiv:2205.01068.
Hanyu Lai, Ming Ding, Zhuoyi Yang, Yifan Xu,
Wendi Zheng, Xiao Xia, et al. 2022a. GLM-130B: Yanzhe Zhang, Ruiyi Zhang, Jiuxiang Gu, Yufan
An Open Bilingual Pre-trained Model. In The Zhou, Nedim Lipka, Diyi Yang, and Tong Sun.
Eleventh International Conference on Learning Rep- 2023f. Llavar: Enhanced visual instruction tuning
resentations. for text-rich image understanding. arXiv preprint
arXiv:2306.17107.
Yan Zeng, Xinsong Zhang, and Hang Li. 2022b. Multi-
Grained Vision Language Pre-Training: Aligning Bo Zhao, Boya Wu, and Tiejun Huang. 2023a. Svit:
Texts with Visual Concepts. In International Con- Scaling up visual instruction tuning. arXiv preprint
ference on Machine Learning, pages 25994–26009. arXiv:2307.04087.
PMLR.
Liang Zhao, En Yu, Zheng Ge, Jinrong Yang, Hao-
Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, ran Wei, Hongyu Zhou, Jianjian Sun, Yuang Peng,
Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. 2023a. Runpei Dong, Chunrui Han, et al. 2023b. Chatspot:
SpeechGPT: Empowering Large Language Models Bootstrapping multimodal llms via precise referring
with Intrinsic Cross-Modal Conversational Abilities. instruction tuning. arXiv preprint arXiv:2307.09474.
In Findings of the Association for Computational Lin-
Min Zhao, Fan Bao, Chongxuan Li, and Jun Zhu. 2022.
guistics: EMNLP 2023, Singapore, December 6-10,
EGSDE: Unpaired Image-to-Image Translation via
2023, pages 15757–15773.
Energy-Guided Stochastic Differential Equations. In
Duzhen Zhang, Wei Cong, Jiahua Dong, Yahan Yu, Xi- Advances in Neural Information Processing Systems
uyi Chen, Yonggang Zhang, and Zhen Fang. 2023b. 35: Annual Conference on Neural Information Pro-
Continual Named Entity Recognition without Catas- cessing Systems 2022, NeurIPS 2022, New Orleans,
trophic Forgetting. In The 2023 Conference on Em- LA, USA, November 28 - December 9, 2022.
pirical Methods in Natural Language Processing.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang,
Duzhen Zhang, Hongliu Li, Wei Cong, Rongtao Xu, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen
Jiahua Dong, and Xiuyi Chen. 2023c. Task relation Zhang, Junjie Zhang, Zican Dong, et al. 2023c. A
distillation and prototypical pseudo label for incre- survey of large language models. arXiv preprint
mental named entity recognition. In Proceedings of arXiv:2303.18223.
the 32nd ACM International Conference on Informa- Yang Zhao, Zhijie Lin, Daquan Zhou, Zilong Huang,
tion and Knowledge Management, pages 3319–3329. Jiashi Feng, and Bingyi Kang. 2023d. Bubogpt: En-
Duzhen Zhang, Yahan Yu, Feilong Chen, and Xiuyi abling visual grounding in multi-modal llms. arXiv
Chen. 2023d. Decomposing Logits Distillation for preprint arXiv:2307.08581.
Incremental Named Entity Recognition. In Proceed- Junhao Zheng, Qianli Ma, Zhen Liu, Binquan Wu, and
ings of the 46th International ACM SIGIR Confer- Huawen Feng. 2024. Beyond Anti-Forgetting: Mul-
ence on Research and Development in Information timodal Continual Instruction Tuning with Positive
Retrieval, pages 1919–1923. Forward Transfer. arXiv preprint arXiv:2401.09181.
Duzhen Zhang, Tielin Zhang, Shuncheng Jia, Qingyu Junhao Zheng, Shengjie Qiu, and Qianli Ma. 2023a.
Wang, and Bo Xu. 2022a. Recent Advances and New Learn or Recall? Revisiting Incremental Learning
Frontiers in Spiking Neural Networks. In Proceed- with Pre-trained Language Models. arXiv preprint
ings of the Thirty-First International Joint Confer- arXiv:2312.07887.
ence on Artificial Intelligence, IJCAI 2022, Vienna,
Austria, 23-29 July 2022, pages 5670–5677. Kaizhi Zheng, Xuehai He, and Xin Eric Wang. 2023b.
Minigpt-5: Interleaved vision-and-language gen-
Hang Zhang, Xin Li, and Lidong Bing. 2023e. Video- eration via generative vokens. arXiv preprint
LLaMA: An Instruction-tuned Audio-Visual Lan- arXiv:2310.02239.
guage Model for Video Understanding. In Proceed-
ings of the 2023 Conference on Empirical Methods Deyao Zhu, Jun Chen, Xiaoqian Shen, Xiang Li, and
in Natural Language Processing, EMNLP 2023 - Mohamed Elhoseiny. 2023a. Minigpt-4: Enhancing
System Demonstrations, Singapore, December 6-10, vision-language understanding with advanced large
2023, pages 543–553. language models. arXiv preprint arXiv:2304.10592.
Wanrong Zhu, Jack Hessel, Anas Awadalla, A Related Surveys
Samir Yitzhak Gadre, Jesse Dodge, Alex Fang,
Youngjae Yu, Ludwig Schmidt, William Yang Wang, Prior to the emergence of LLMs, several surveys
and Yejin Choi. 2023b. Multimodal c4: An open, on traditional MM PT have been conducted (Ruan
billion-scale corpus of images interleaved with text. and Jin, 2022; Du et al., 2022a; Long et al., 2022;
arXiv preprint arXiv:2304.06939.
Chen et al., 2023a). Most of these models entail a
Yuke Zhu, Oliver Groth, Michael Bernstein, and Li Fei- substantial computational cost during the PT phase,
Fei. 2016. Visual7w: Grounded question answering attributable to end-to-end training using large-scale
in images. In Proceedings of the IEEE conference
models and datasets. As a consequence of not incor-
on computer vision and pattern recognition, pages
4995–5004. porating LLMs, these models suffer from deficien-
cies in instruction following, ICL, CoT, and inter-
active capabilities. Moreover, the training pipeline
solely encompasses the PT phase without the inclu-
sion of an IT stage.
In recent times, several surveys have emerged
on MM-LLMs. Yin et al. and Wu et al. exclu-
sively delve into early VL understanding models.
Huang et al. place a primary emphasis on visual IT,
while Song et al. focus on modal alignment meth-
ods. Lastly, Cui et al. provide a comprehensive
review of the applications of MM-LLMs within the
realm of autonomous driving.
Compared with their works, the main distinc-
tions are outlined as follows:
• We have comprehensively covered nearly all
MM-LLMs over the past year, including not
only understanding models but also generative
models. Our coverage extends beyond VL
modalities to encompass various modes such
as audio and 3D;
• To offer readers a comprehensive understand-
ing of MM-LLMs, we have introduced a gen-
eral model architecture that incorporates any-
to-any modality transformations, offering a
detailed overview of the functional roles and
implementation choices for each component;
• We have summarized the developmental
trends of existing MM-LLMs and provided
some training recipes that can enhance effec-
tiveness;
• We have established an open-source website
for MM-LLMs researchers, supporting crowd-
sourced updates and aiming to facilitate col-
laboration in the MM-LLMs field. We antic-
ipate that this survey will illuminate future
research in the MM-LLMs domain.

B Mainstream PEFT Methods


PEFT entails maintaining the pre-trained LLM in a
frozen state while adjusting a small number of ad-
ditional trainable parameters. In the following sec- • Flan-T5 (Chung et al., 2022) investigates
tion, we revisit several representative PEFT meth- IT for T5 (Raffel et al., 2020), an encoder-
ods, where x and h represent the input and output decoder architecture using unified text-to-text
of the original module, and h′ signifies the output training for all natural language processing
of this module when attached with PEFT. issues, exhibiting robust zero-shot and CoT
capabilities.
Prefix-tuning (Li and Liang, 2021; Lester et al.,
2021) involves the addition of learnable tokens to • ChatGLM2 is a Chinese-English bilingual
the keys and values of the attention module. This dialogue model, optimized by an auto-
process is formulated as follows: regressive mask infilling objective. It is based
on the GLM (Du et al., 2022b; Zeng et al.,
h′ = Attn (xWq , [Pk , xWk ], [Pv , xWv ]) , (6)
2022a) architecture, optimized for Chinese
with Pk , Pv ∈ Rl×d representing two sets of prefix question answering and dialogues.
tokens. [·, ·] denotes concatenation, and Attn is
• UL2 (Tay et al., 2022) is an encoder-decoder
defined as:
model trained utilizing a mixture of denoisers
QKT objectives, surpassing T5 on numerous bench-
 
Attn (Q, K, V) := softmax √ V. marks.
d
Adapter (Houlsby et al., 2019; He et al., 2021; • Qwen (Bai et al., 2023a) is trained on large-
Rebuffi et al., 2017; Zhang et al., 2020) is typically scale and diverse datasets, with a primary fo-
a residual block consisting of a down-projection cus on Chinese and English. It employs SFT
matrix A, a nonlinear activation function σ(·), and and RLHF techniques for alignment, resulting
an up-projection matrix B. It can be inserted into in dialogue models like Qwen-Chat.
any layer of the pre-trained LLM, formulated as
follows: • Chinchilla (Hoffmann et al., 2022) is a causal
h′ = h + σ(xA)B. (7) decoder, trained on extensive text data. It
posits that model size should double for every
LoRA (Hu et al., 2021) is the most commonly doubling of training tokens.
used PEFT method. It assumes that the change in
parameters occurs within a low-rank space. Given • OPT (Zhang et al., 2022b) is a GPT-3 (Brown
a pre-trained matrix W ∈ Rc×d , LoRA learns an et al., 2020) clone, striving to release an open-
incremental update ∆W and decomposes ∆W source model that replicates the performance
into a matrix multiplication between two low-rank of GPT-3.
matrices A ∈ Rc×r and B ∈ Rr×d , where r ≪
min(c, d). LoRA follows the forward process as • PaLM (Chowdhery et al., 2023) is a causal
outlined below: decoder structure with parallel attention and
feed-forward layers, enabling training speeds
h = W x + ∆W x = W x + ABx. (8) up to 15 times faster. Notable changes contain
RoPE embeddings, SwiGLU activation, multi-
QLoRA (Dettmers et al., 2023) is a quantized query attention, and etc.
LoRA. The underlying principle of QLoRA in-
cludes the quantization of pre-trained weights to • LLaMA (Touvron et al., 2023a) comprises
4 bits, followed by the execution of PEFT using decoder-only models with efficient causal at-
LoRA. tention.
In addition to the aforementioned PEFT methods,
there are several others, including AdaptBias (Fu • LLaMA-2 (Touvron et al., 2023b) focuses
et al., 2022), Compacter (Karimi Mahabadi et al., on fine-tuning a superior and safer LLaMA-
2021), and AdapterFormer (Chen et al., 2022a). 2-Chat model for conversation generation,
incorporating 40% more training data with
C Commonly Used LLMs grouped-query attention and a larger context
length.
The commonly used LLM Backbones in existing
2
MM-LLMs research are as follows: https://github.com/THUDM/ChatGLM-6B
• Vicuna (Chiang et al., 2023) is a model built of LLM learning, using the interleaved attributes
on top of LLaMA, utilizing user dialogue data of image-text pairs, and implementing meticulous
obtained from ShareGPT.com and trained by text data re-blending.
SFT.
E VL Benchmarks
D SOTA MM-LLMs (continued)
The 18 VL benchmarks presented in Table 2 in-
(20) LLaVA-1.5 (Liu et al., 2023d) reports simple clude OKVQA (Schwenk et al., 2022), Icon-
modifications to the LLaVA framework, including VQA (Lu et al., 2021), VQAv2 (Goyal et al., 2017),
applying an MLP projection and introducing VQA GQA (Hudson and Manning, 2019), VizWiz (Gu-
data tailored for academic tasks, along with simple rari et al., 2018), SQAI : ScienceQA-IMG (Lu
response formatting prompts. These adjustments et al., 2022), VQAT : TextVQA (Singh et al., 2019),
result in enhanced capabilities for MM understand- POPE (Li et al., 2023g), MMEP : MME Per-
ing. ception (Fu et al., 2023), MMEC : MME Cogni-
(21) MiniGPT-v2 (Chen et al., 2023c) is an MM- tion (Fu et al., 2023), MMB: MMBenchmark (Liu
LLM designed as a unified interface for diverse et al., 2023f), MMBCN : MMBench-Chinese (Liu
VL multi-task learning. To create a single model et al., 2023f), SEEDI : SEED-Bench (Image) (Li
proficient in handling multiple VL tasks, identifiers et al., 2023b), LLaVAW : LLaVA-Bench (In-the-
are incorporated for each task during both training Wild) (Liu et al., 2023a), MM-Vet (Yu et al.,
and inference. This facilitates clear task distinction, 2023), QBench (Wu et al., 2023b), HM: Hate-
ultimately enhancing learning efficiency. fulMemes (Kiela et al., 2020), and VSR (Liu et al.,
(22) CogVLM (Wang et al., 2023) is an open- 2023a).
source MM-LLM that bridges the gap between
modalities via a trainable visual expert module F Training Dataset
within the attention and feedforward layers. This
The statistics for MM PT and MM IT dataset are
allows for a deep fusion of MM features without
presented in Table 3 and Table 4, respectively.
compromising performance on NLP downstream
tasks.
(23) DRESS (Chen et al., 2023h) introduces
a method using natural language feedback to en-
hance alignment with human preferences. DRESS
extends the conditional reinforcement learning al-
gorithm to integrate non-differentiable natural lan-
guage feedback, training the model to generate
appropriate responses based on feedback.
(24) X-InstructBLIP (Panagopoulou et al.,
2023) introduces a cross-modal framework with
instruction-aware representations, scalable enough
to empower LLMs to handle diverse tasks across
multiple modalities, including image/video, audio,
and 3D. Notably, it achieves this without the need
for modality-specific PT.
(25) CoDi-2 (Tang et al., 2023a) is a MM gen-
eration model excelling in modality-interleaved in-
struction following, in-context generation, and user-
model interaction by multi-turn conversations. It
enhances CoDi (Tang et al., 2023b) to process intri-
cate modality-interleaved inputs and instructions,
generating latent features autoregressively.
(26) VILA (Lin et al., 2023) outperforms in vi-
sion tasks and shows remarkable reasoning abil-
ity while maintaining text-only capabilities. It
achieves this by harnessing the full capabilities
Dataset Name X Modality #.X #.T #.X-T
ALIGN (Jia et al., 2021) Image 1.8B 1.8B 1.8B
LTIP (Alayrac et al., 2022) Image 312M 312M 312M
MS-COCO (Lin et al., 2014) Image 124K 620K 620K
Visual Genome (Krishna et al., 2017) Image 108K 4.5M 4.5M
CC3M (Sharma et al., 2018) Image 3.3M 3.3M 3.3M
CC12M (Changpinyo et al., 2021) Image 12.4M 12.4M 12.4M
SBU (Ordonez et al., 2011) Image 1M 1M 1M
LAION-5B (Schuhmann et al., 2022) Image 5.9B 5.9B 5.9B
LAION-400M (Schuhmann et al., 2021) Image 400M 400M 400M
LAION-en (Schuhmann et al., 2022) Image 2.3B 2.3B 2.3B
LAION-zh (Schuhmann et al., 2022) Image 142M 142M 142M
LAION-COCO (Schuhmann et al., 2022b) Image 600M 600M 600M
Flickr30k (Young et al., 2014) Image 31K 158K 158K
AI Challenger Captions (Wu et al., 2017) Image 300K 1.5M 1.5M
COYO (Byeon et al., 2022) Image 747M 747M 747M
Wukong (Gu et al., 2022) Image 101M 101M 101M
COCO Caption (Chen et al., 2015) Image 164K 1M 1M
WebLI (Chen et al., 2022b) Image 10B 12B 12B
Episodic WebLI (Chen et al., 2023g) Image 400M 400M 400M
CC595k (Liu et al., 2023e) Image 595K 595K 595K
RefCOCO (Kazemzadeh et al., 2014) Image 20K 142K 142K
RefCOCO+ (Yu et al., 2016) Image 20K 142K 142K
Visual-7W (Zhu et al., 2016) Image 47.3K 328K 328K
OCR-VQA (Mishra et al., 2019) Image 207K 1M 1M
ST-VQA (Biten et al., 2022) Image 23K 32K 32K
DocVQA (Mathew et al., 2021) Image 12K 50K 50K
TextVQA (Singh et al., 2019) Image 28.4K 45.3K 45.3K
DataComp (Gadre et al., 2023) Image 1.4B 1.4B 1.4B
GQA (Hudson and Manning, 2019) Image 113K 22M 22M
VGQA (Krishna et al., 2017) Image 108K 1.7M 1.7M
VQAv2 (Goyal et al., 2017) Image 265K 1.4M 1.4M
DVQA (Kafle et al., 2018) Image 300K 3.5M 3.5M
OK-VQA (Schwenk et al., 2022) Image 14K 14K 14K
A-OKVQA (Schwenk et al., 2022) Image 23.7K 24.9K 24.9K
Text Captions (Sidorov et al., 2020) Image 28K 145K 145K
M3W (Interleaved) (Alayrac et al., 2022) Image 185M 182GB 43.3M (Instances)
MMC4 (Interleaved) (Zhu et al., 2023b) Image 571M 43B 101.2M (Instances)
MSRVTT (Xu et al., 2016) Video 10K 200K 200K
WebVid (Bain et al., 2021) Video 10M 10M 10M
VTP (Alayrac et al., 2022) Video 27M 27M 27M
AISHELL-2 (Chen et al., 2023b) Audio – – 128K
AISHELL-2 (Chen et al., 2023b) Audio – – 1M
WaveCaps (Mei et al., 2023) Audio 403K 403K 403K
VSDial-CN (Chen et al., 2023b) Image, Audio 120K (Image), 1.2M(Audio) 120K 1.2M

Table 3: The statistics for MM PT datasets. #.X represents the quantity of X, #.T represents the quantity of Text,
and #.X-T represents the quantity of X-Text pairs, where X can be Image, Video, or Audio.
Dataset Name Type I→O Source Method Multi-Turn #.I/V/A #.Dialog Turn #.Instance
MiniGPT-4’s IT (Zhu et al., 2023a) SFT I+T→T CC3M, CC12M Auto. % 134M/–/– 1 5K
StableLLaVA (Li et al., 2023f) SFT I+T→T SD (Rombach et al., 2022) Auto.+Manu. % 126K/–/– 1 126K
LLaVA’s IT (Zhang et al., 2023f) SFT I+T→T MS-COCO Auto. " 81K/–/– 2.29 150K
SVIT (Zhao et al., 2023a) SFT I+T→T MS-COCO, Visual Genome Auto. " 108K/–/– 5 3.2M
LLaVAR (Zhang et al., 2023f) SFT I+T→T MS-COCO, CC3M, LAION LLaVA+Auto. " 20K/–/– 2.27 174K
ShareGPT4V (Chen et al., 2023e) SFT I+T→T LCS, COCO, SAM, TextCaps, WikiArt Auto.+Manu. % 100K/–/– – –
DRESS’s IT (Chen et al., 2023h) SFT I+T→T LLaVA’s IT, VLSafe Auto.+Manu. " 193K/–/– ∼4 –
VideoChat’s IT (Li et al., 2023d) SFT V+T→T WebVid Auto. " –/8K/– 1.82 11K
Video-ChatGPT’s IT (Maaz et al., 2023) SFT V+T→T ActivityNet (Caba Heilbron et al., 2015) Inherit " –/100K/– 1 100K
Video-LLaMA’s IT (Zhang et al., 2023e) SFT I/V+T→T MiniGPT-4, LLaVA, and VideoChat’s IT Auto. " 81K/8K/– 2.22 171K
InstructBLIP’s IT (Dai et al., 2023) SFT I/V+T→T Multiple (InstructBLIP’s Figure 2) Auto. % – – ∼1.6M
X-InstructBLIP’s IT (Panagopoulou et al., 2023) SFT I/V/A/3D+T→T Multiple (X-InstructBLIP’s Figure 4) Auto. % – – ∼1.8M
MIMIC-IT (Li et al., 2023a) SFT I/V+T→T Multiple Auto. % 8.1M/502K/– 1 2.8M
PandaGPT’s IT (Su et al., 2023) SFT I+T→T MiniGPT-4 and LLaVA’s IT Inherit " 81K/–/– 2.29 160K
MGVLID (Zhao et al., 2023b) SFT I+B+T→T Multiple Auto.+Manu. % 108K/–/– – 108K
M3 IT (Li et al., 2023e) SFT I/V/B+T→T Multiple Auto.+Manu. % –/–/– 1 2.4M
LAMM (Yin et al., 2023b) SFT I+3D+T→T Multiple Auto.+Manu. " 91K/–/– 3.27 196K
BuboGPT’s IT (Zhao et al., 2023d) SFT (I+A)/A+T→T Clotho, VGGSS Auto. % 5K/–/9K – 9K
mPLUG-DocOwl’s IT (Ye et al., 2023) SFT I/Tab/Web+T→T Multiple Inherit % – – –
T2M (Wu et al., 2023d) SFT T→I/V/A+T WebVid, CC3M, AudioCap Auto. % 4.9K/4.9K/4.9K 1 14.7K
MosIT (Wu et al., 2023d) SFT I+V+A+T→I+V+A+T Youtube, Google, Flickr30k, Midjourney, etc. Auto.+Manu. " 4K/4K/4K 4.8 5K
DRESS’s IT (Chen et al., 2023h) RLHF I+T→T LLaVA’s IT, VLSafe Auto.+Manu. " 33K/–/– ∼4 –

Table 4: The statistics for MM IT datasets. I→O: Input to Output Modalities, T: Text, I: Image, V: Video, A: Audio,
B: Bounding box, 3D: Point Cloud, Tab: Table, and Web: Web page.

You might also like