Recent Advances in
Vision Foundation
ModelsLijuan
Wang
Microsoft
6/17/202
4
Recent Advances in Vision Foundation Models
• Looking back, CLIP was a big paradigm shift
• LMMs extends LLMs with multi-sensory skill to achieve generic
intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent
• Diffusion model as a vision-centered representation learner
• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
2
CLIP is a Big Paradigm
Shift
Rather than needing handcrafted labels to train a good classifier for a given domain, we can leverage free-
form text from the internet to learn a model that is a good classifier for all domains
Alec Radford, etc. Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020 26 Feb. 2021
3
How to extract intelligence from visual world?
• Looking back, CLIP was a big paradigm shift
• LMMs extends LLMs with multi-sensory skill to achieve generic
intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The dawn of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent
• Diffusion model as a vision-centered representation learner
• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators 4
Prelude of LMMs (Early Vision-Language Models)
• Early VLP models depend on pre-trained object detectors to extract visual
features offline.
• Newer end-to-end VLP models achieve stronger performance with model
and data scaling.
• Upscaled VLP models demonstrate new capabilities such as in-context
learning and multimodal few shots.
ViLBERT UNITER VinVL CLIP ALBEF PICa UNICOR GLIP Flamingo GIT
N
Aug. 6th, 2019 Sep. 25th, 2019 Jan. 2nd, 2021 Feb. 26th, 2021
Jul. 16th, 2021 Sep. 10th, 2021 Nov. 23rd, 2021 Dec. 7th, 2021 Apr. 29th, 2022 May. 27th, 2022
Feb. 11st, 2021 Jun. 25th, 2021 Aug. 24th, 2021 Nov. 3rd, 2021 Nov. 24th, 2021 Feb. 7th, 2022 May. 4th, 2022
Aug. 20th, 2019 Apr. 13th, 2020
ALIGN Frozen OFA CoCa
LXMERT OSCAR SimVLM METER LEMON
Credit: CVPR 2022 Tutorial on Vision Foundation Models
GIT: A Generative Image-to-text Transformer for Vision and
Language
Flamingo: a Visual Language Model for Few-Shot Learning
The Era of LMMs starts from GPT-4V
“The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)” work
from our colleagues at Microsoft covers a plethora of practical observations
and strategies for using GPT-4V.
https://openai.com/contributions/gpt-4v/
GPT-4V Emerging Capability Highlights
• LMM emergent capabilities
Visual Pointing Spot the Difference
Interleaved Image-text Sequence
[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
GPT-4V Emerging Capability Highlights
• Genericity
Table, GUI, Coding, Video, Grocery, Embodied, etc.
[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
Evolution of LMMs (A Surge of Open Source LMMs since GPT-
4V)
11
Credit: CVPR 2023 Tutorial on Vision Foundation Models
LLaVA: Visual Instruction Tuning
Landscape of LMMs (Open-Source LMMs and Proprietary
LMMs)
On MMMU leaderboard
• 30 Open-Source LMMs
• 21 Proprietary LMMs
-- as of 6/13/2024
13
X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, etc. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert
Rapid Progress in LMMs
• MM- : Evaluating integrated vision-language capabilities
Vet
GPT-4o: 69.3
GPT-4V: 67.7
[1] Yu, Weihao, et al. "Mm-vet: Evaluating large multimodal models for integrated capabilities." ICML 2024 https://github.com/yuweihao/MM-Vet
I/O Modality of GPT-4o
GPT-4o (“o” for “omni”) is a step towards much more natural
human-computer interaction—it accepts as input any combination
of text, audio, image, and video and generates any combination of
text, audio, and image outputs.
It would be interesting to investigate what additional capabilities a
model combining all these modalities could achieve beyond current
capabilities and how to make the native integration efficient so that
different modalities enhance each other.
https://openai.com/index/hello-gpt-4o/ 15
LMM inspired new research area -- Visual
Prompting
[1] Yang, Jianwei, et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V " https://som-gpt4v.github.io/
LMM inspired new research area –
Multimodal Agent
Agents with Multimodal Memory
MM-Narrator Audio Description
Actionable Agents
MM-Navigator GUI Navigation
Agent with Feedback
Visual Design &
Creation
https://multimodal-vid.github.io/
Recent Advances in Vision Foundation Models
• Looking back, CLIP was a big paradigm shift
• The past year was the year of LMMs
• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent
• Diffusion model as a vision-centered representation learner
• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
18
Your Diffusion Model is Secretly a Zero-Shot Classifier
Alexander C. Li, etc., Your diffusion model is secretly a zero-shot classifier. https://arxiv.org/abs/2303.16203. 28 Mar 2023
19
SDXL Dalle-3
July 2023 Sept. 2023
DALL-E 3: Reconstruct Image from Ultra-descriptive Caption
“Golden hour illuminates blooming cherry blossom
trees around a pond.
In the distance, a building with Japanese-inspired
architecture is perched on the lake.
In the pond, a group of people enjoying the serenity
of the sunset in a rowboat.
A woman underneath a cherry blossom tree is
setting
up a picnic on a yellow checkered blanket.”
• Takeaway from DALL-E 3: training on ultra-
descriptive captions makes the model more
compute-efficient, even when we measure the
quality of samples produced with shorter captions
• Suggests that we can get better unconditional
models by using UD captions as scaffolding, even if
we don’t use UD captions at inference
-- Aditya Ramesh
https://openai.com/index/dall-e-3/
21
Video Generation Models as World
Simulators
VDM 2022 Emu Video 2023 Sora 2024 Timeline
SORA: Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
2
https://openai.com/index/video-generation-models-as-world-simulators/
Recent Advances in Vision Foundation Models
• Looking back, CLIP was a big paradigm shift
• LMMs extends LLMs with multi-sensory skill to achieve generic
intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent
• Diffusion model as a vision-centered representation learner
• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
23
Discussion: World Model
Thank
You!
25