0% found this document useful (0 votes)
25 views25 pages

Lijuan Slides Cvpr2024 Fundationmodels

The document discusses recent advancements in Vision Foundation Models, highlighting the significant paradigm shift introduced by CLIP and the emergence of Large Multimodal Models (LMMs) starting with GPT-4V. It covers the evolution of LMMs, the landscape of open-source and proprietary models, and new research areas such as visual prompting and multimodal agents. Additionally, it explores the role of diffusion models in vision representation learning and the capabilities of models like DALL-E 3 and SORA in generating images and videos from descriptive inputs.

Uploaded by

周炎兵
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
25 views25 pages

Lijuan Slides Cvpr2024 Fundationmodels

The document discusses recent advancements in Vision Foundation Models, highlighting the significant paradigm shift introduced by CLIP and the emergence of Large Multimodal Models (LMMs) starting with GPT-4V. It covers the evolution of LMMs, the landscape of open-source and proprietary models, and new research areas such as visual prompting and multimodal agents. Additionally, it explores the role of diffusion models in vision representation learning and the capabilities of models like DALL-E 3 and SORA in generating images and videos from descriptive inputs.

Uploaded by

周炎兵
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 25

Recent Advances in

Vision Foundation
ModelsLijuan
Wang
Microsoft

6/17/202
4
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic


intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner


• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
2
CLIP is a Big Paradigm
Shift
Rather than needing handcrafted labels to train a good classifier for a given domain, we can leverage free-
form text from the internet to learn a model that is a good classifier for all domains

Alec Radford, etc. Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020 26 Feb. 2021
3
How to extract intelligence from visual world?

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic


intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The dawn of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner


• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators 4
Prelude of LMMs (Early Vision-Language Models)

• Early VLP models depend on pre-trained object detectors to extract visual


features offline.
• Newer end-to-end VLP models achieve stronger performance with model
and data scaling.
• Upscaled VLP models demonstrate new capabilities such as in-context
learning and multimodal few shots.

ViLBERT UNITER VinVL CLIP ALBEF PICa UNICOR GLIP Flamingo GIT
N

Aug. 6th, 2019 Sep. 25th, 2019 Jan. 2nd, 2021 Feb. 26th, 2021
Jul. 16th, 2021 Sep. 10th, 2021 Nov. 23rd, 2021 Dec. 7th, 2021 Apr. 29th, 2022 May. 27th, 2022

Feb. 11st, 2021 Jun. 25th, 2021 Aug. 24th, 2021 Nov. 3rd, 2021 Nov. 24th, 2021 Feb. 7th, 2022 May. 4th, 2022
Aug. 20th, 2019 Apr. 13th, 2020

ALIGN Frozen OFA CoCa


LXMERT OSCAR SimVLM METER LEMON

Credit: CVPR 2022 Tutorial on Vision Foundation Models


GIT: A Generative Image-to-text Transformer for Vision and
Language
Flamingo: a Visual Language Model for Few-Shot Learning
The Era of LMMs starts from GPT-4V

“The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)” work


from our colleagues at Microsoft covers a plethora of practical observations
and strategies for using GPT-4V.

https://openai.com/contributions/gpt-4v/
GPT-4V Emerging Capability Highlights

• LMM emergent capabilities

Visual Pointing Spot the Difference

Interleaved Image-text Sequence


[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
GPT-4V Emerging Capability Highlights

• Genericity

Table, GUI, Coding, Video, Grocery, Embodied, etc.


[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
Evolution of LMMs (A Surge of Open Source LMMs since GPT-
4V)

11
Credit: CVPR 2023 Tutorial on Vision Foundation Models
LLaVA: Visual Instruction Tuning
Landscape of LMMs (Open-Source LMMs and Proprietary
LMMs)

On MMMU leaderboard
• 30 Open-Source LMMs
• 21 Proprietary LMMs

-- as of 6/13/2024

13
X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, etc. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert
Rapid Progress in LMMs

• MM- : Evaluating integrated vision-language capabilities


Vet
GPT-4o: 69.3
GPT-4V: 67.7

[1] Yu, Weihao, et al. "Mm-vet: Evaluating large multimodal models for integrated capabilities." ICML 2024 https://github.com/yuweihao/MM-Vet
I/O Modality of GPT-4o

GPT-4o (“o” for “omni”) is a step towards much more natural


human-computer interaction—it accepts as input any combination
of text, audio, image, and video and generates any combination of
text, audio, and image outputs.

It would be interesting to investigate what additional capabilities a


model combining all these modalities could achieve beyond current
capabilities and how to make the native integration efficient so that
different modalities enhance each other.

https://openai.com/index/hello-gpt-4o/ 15
LMM inspired new research area -- Visual
Prompting

[1] Yang, Jianwei, et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V " https://som-gpt4v.github.io/
LMM inspired new research area –
Multimodal Agent
Agents with Multimodal Memory

MM-Narrator Audio Description

Actionable Agents

MM-Navigator GUI Navigation

Agent with Feedback


Visual Design &
Creation

https://multimodal-vid.github.io/
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• The past year was the year of LMMs


• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner


• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
18
Your Diffusion Model is Secretly a Zero-Shot Classifier

Alexander C. Li, etc., Your diffusion model is secretly a zero-shot classifier. https://arxiv.org/abs/2303.16203. 28 Mar 2023
19
SDXL Dalle-3
July 2023 Sept. 2023
DALL-E 3: Reconstruct Image from Ultra-descriptive Caption

“Golden hour illuminates blooming cherry blossom


trees around a pond.
In the distance, a building with Japanese-inspired
architecture is perched on the lake.
In the pond, a group of people enjoying the serenity
of the sunset in a rowboat.
A woman underneath a cherry blossom tree is
setting
up a picnic on a yellow checkered blanket.”

• Takeaway from DALL-E 3: training on ultra-


descriptive captions makes the model more
compute-efficient, even when we measure the
quality of samples produced with shorter captions
• Suggests that we can get better unconditional
models by using UD captions as scaffolding, even if
we don’t use UD captions at inference

-- Aditya Ramesh
https://openai.com/index/dall-e-3/
21
Video Generation Models as World
Simulators

VDM 2022 Emu Video 2023 Sora 2024 Timeline

SORA: Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
2
https://openai.com/index/video-generation-models-as-world-simulators/
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic


intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner


• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
23
Discussion: World Model
Thank
You!

25

You might also like