0% found this document useful (0 votes)

25 views25 pages

Lijuan Slides Cvpr2024 Fundationmodels

The document discusses recent advancements in Vision Foundation Models, highlighting the significant paradigm shift introduced by CLIP and the emergence of Large Multimodal Models (LMMs) starting with GPT-4V. It covers the evolution of LMMs, the landscape of open-source and proprietary models, and new research areas such as visual prompting and multimodal agents. Additionally, it explores the role of diffusion models in vision representation learning and the capabilities of models like DALL-E 3 and SORA in generating images and videos from descriptive inputs.

Uploaded by

周炎兵

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

25 views25 pages

Lijuan Slides Cvpr2024 Fundationmodels

Uploaded by

周炎兵

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 25

Recent Advances in

Vision Foundation
ModelsLijuan
Wang
Microsoft

6/17/202
4
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic

intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner

• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
2
CLIP is a Big Paradigm
Shift
Rather than needing handcrafted labels to train a good classifier for a given domain, we can leverage free-
form text from the internet to learn a model that is a good classifier for all domains

Alec Radford, etc. Learning Transferable Visual Models From Natural Language Supervision. https://arxiv.org/abs/2103.00020 26 Feb. 2021
3
How to extract intelligence from visual world?

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic

intelligence
• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The dawn of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner

• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators 4
Prelude of LMMs (Early Vision-Language Models)

• Early VLP models depend on pre-trained object detectors to extract visual

features offline.
• Newer end-to-end VLP models achieve stronger performance with model
and data scaling.
• Upscaled VLP models demonstrate new capabilities such as in-context
learning and multimodal few shots.

ViLBERT UNITER VinVL CLIP ALBEF PICa UNICOR GLIP Flamingo GIT
N

Aug. 6th, 2019 Sep. 25th, 2019 Jan. 2nd, 2021 Feb. 26th, 2021
Jul. 16th, 2021 Sep. 10th, 2021 Nov. 23rd, 2021 Dec. 7th, 2021 Apr. 29th, 2022 May. 27th, 2022

Feb. 11st, 2021 Jun. 25th, 2021 Aug. 24th, 2021 Nov. 3rd, 2021 Nov. 24th, 2021 Feb. 7th, 2022 May. 4th, 2022
Aug. 20th, 2019 Apr. 13th, 2020

ALIGN Frozen OFA CoCa

LXMERT OSCAR SimVLM METER LEMON

Credit: CVPR 2022 Tutorial on Vision Foundation Models

GIT: A Generative Image-to-text Transformer for Vision and
Language
Flamingo: a Visual Language Model for Few-Shot Learning
The Era of LMMs starts from GPT-4V

“The Dawn of LMMs: Preliminary Explorations with GPT-4V(ision)” work

from our colleagues at Microsoft covers a plethora of practical observations
and strategies for using GPT-4V.

https://openai.com/contributions/gpt-4v/
GPT-4V Emerging Capability Highlights

• LMM emergent capabilities

Visual Pointing Spot the Difference

Interleaved Image-text Sequence

[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
GPT-4V Emerging Capability Highlights

• Genericity

Table, GUI, Coding, Video, Grocery, Embodied, etc.

[1] "The Dawn of LMMs: Preliminary Explorations with GPT-4V (ision)."
Evolution of LMMs (A Surge of Open Source LMMs since GPT-
4V)

11
Credit: CVPR 2023 Tutorial on Vision Foundation Models
LLaVA: Visual Instruction Tuning
Landscape of LMMs (Open-Source LMMs and Proprietary
LMMs)

On MMMU leaderboard
• 30 Open-Source LMMs
• 21 Proprietary LMMs

-- as of 6/13/2024

13
X. Yue, Y. Ni, K. Zhang, T. Zheng, R. Liu, G. Zhang, S. Stevens, etc. MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark for Expert
Rapid Progress in LMMs

• MM- : Evaluating integrated vision-language capabilities

Vet
GPT-4o: 69.3
GPT-4V: 67.7

[1] Yu, Weihao, et al. "Mm-vet: Evaluating large multimodal models for integrated capabilities." ICML 2024 https://github.com/yuweihao/MM-Vet
I/O Modality of GPT-4o

GPT-4o (“o” for “omni”) is a step towards much more natural

human-computer interaction—it accepts as input any combination
of text, audio, image, and video and generates any combination of
text, audio, and image outputs.

It would be interesting to investigate what additional capabilities a

model combining all these modalities could achieve beyond current
capabilities and how to make the native integration efficient so that
different modalities enhance each other.

https://openai.com/index/hello-gpt-4o/ 15
LMM inspired new research area -- Visual
Prompting

[1] Yang, Jianwei, et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V " https://som-gpt4v.github.io/
LMM inspired new research area –
Multimodal Agent
Agents with Multimodal Memory

MM-Narrator Audio Description

Actionable Agents

MM-Navigator GUI Navigation

Agent with Feedback

Visual Design &
Creation

https://multimodal-vid.github.io/
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• The past year was the year of LMMs

• Prelude of LMMs (Early Vision-Language Models, Flamingo, CoCa,
GIT)
• The era of LMMs starts from GPT-4V
• Landscape of open-source and proprietary LMMs
• New research areas: Grounding LMMs, Visual prompting,
Multimodal Agent

• Diffusion model as a vision-centered representation learner

• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
18
Your Diffusion Model is Secretly a Zero-Shot Classifier

Alexander C. Li, etc., Your diffusion model is secretly a zero-shot classifier. https://arxiv.org/abs/2303.16203. 28 Mar 2023
19
SDXL Dalle-3
July 2023 Sept. 2023
DALL-E 3: Reconstruct Image from Ultra-descriptive Caption

“Golden hour illuminates blooming cherry blossom

trees around a pond.
In the distance, a building with Japanese-inspired
architecture is perched on the lake.
In the pond, a group of people enjoying the serenity
of the sunset in a rowboat.
A woman underneath a cherry blossom tree is
setting
up a picnic on a yellow checkered blanket.”

• Takeaway from DALL-E 3: training on ultra-

descriptive captions makes the model more
compute-efficient, even when we measure the
quality of samples produced with shorter captions
• Suggests that we can get better unconditional
models by using UD captions as scaffolding, even if
we don’t use UD captions at inference

-- Aditya Ramesh
https://openai.com/index/dall-e-3/
21
Video Generation Models as World
Simulators

VDM 2022 Emu Video 2023 Sora 2024 Timeline

SORA: Our results suggest that scaling video generation models is a promising path towards building general purpose simulators of the physical world.
2
https://openai.com/index/video-generation-models-as-world-simulators/
Recent Advances in Vision Foundation Models

• Looking back, CLIP was a big paradigm shift

• LMMs extends LLMs with multi-sensory skill to achieve generic

• Diffusion model as a vision-centered representation learner

• Your diffusion model is secretly a zero shot classifier
• DALL-E 3: reconstruct image from ultra-descriptive caption
• SORA: video generation models as world simulators
23
Discussion: World Model
Thank
You!

Lijuan Slides Cvpr2024 Fundationmodels
No ratings yet
Lijuan Slides Cvpr2024 Fundationmodels
25 pages
Visual Large Language Models Overview
No ratings yet
Visual Large Language Models Overview
29 pages
Survey of Vision-Language Models
No ratings yet
Survey of Vision-Language Models
35 pages
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
No ratings yet
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
24 pages
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
No ratings yet
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
54 pages
Exploring
No ratings yet
Exploring
16 pages
LVLM Survey
No ratings yet
LVLM Survey
22 pages
Lecture-27-Introduction To VLM
No ratings yet
Lecture-27-Introduction To VLM
46 pages
Lec8 - Large Multimodal Models
No ratings yet
Lec8 - Large Multimodal Models
45 pages
A Survey of State of The Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
No ratings yet
A Survey of State of The Art Large Vision Language Models: Alignment, Benchmark, Evaluations and Challenges
22 pages
Visionllama
No ratings yet
Visionllama
17 pages
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Instant Download
100% (1)
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Instant Download
159 pages
Pixel To Phrases
No ratings yet
Pixel To Phrases
6 pages
VLM Presentation
No ratings yet
VLM Presentation
12 pages
LVLM Review Paper List and Publication Details
No ratings yet
LVLM Review Paper List and Publication Details
3 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
No ratings yet
1-2024-arxiv-MobileVLM V2：更快更强的视觉语言模型基线
15 pages
Efficient Multimodal Large Language Models - A Survey
No ratings yet
Efficient Multimodal Large Language Models - A Survey
36 pages
Curated List of Top AI Papers 2023
No ratings yet
Curated List of Top AI Papers 2023
1 page
Vitron
No ratings yet
Vitron
22 pages
VCoder Versatile Vision Encoders For Multimodal Large Language Models
No ratings yet
VCoder Versatile Vision Encoders For Multimodal Large Language Models
14 pages
In Context Learning Presentation
No ratings yet
In Context Learning Presentation
13 pages
Evolution of Multimodal AI Models
No ratings yet
Evolution of Multimodal AI Models
14 pages
2023 LLMBC Whats Next
No ratings yet
2023 LLMBC Whats Next
95 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
Small Vision-Language Models: A Survey On Compact Architectures and Techniques
No ratings yet
Small Vision-Language Models: A Survey On Compact Architectures and Techniques
39 pages
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
No ratings yet
The Dawn of LMMs Preliminary Explorations With GPT 4V Ision Arxiv 2309 17421v2 Cs CV 11 Oct 2023 1st Edition Zhengyuan Yang Full Access
77 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
Innovations in LLMs Presentation Expanded MSOffice
No ratings yet
Innovations in LLMs Presentation Expanded MSOffice
24 pages
Harnessing Large Language Models For Training-Free Video Anomaly Detection
No ratings yet
Harnessing Large Language Models For Training-Free Video Anomaly Detection
13 pages
Holmes-VAD - Towards Unbiased and Explainable Video Anomaly Detection Via Multi-Modal LLM
No ratings yet
Holmes-VAD - Towards Unbiased and Explainable Video Anomaly Detection Via Multi-Modal LLM
19 pages
Ethical Issues in Large Vision Datasets
No ratings yet
Ethical Issues in Large Vision Datasets
25 pages
CountCLIP Lingchen
No ratings yet
CountCLIP Lingchen
15 pages
Grounded SAM - Assembling Open-World Models For Diverse Visual Tasks
No ratings yet
Grounded SAM - Assembling Open-World Models For Diverse Visual Tasks
11 pages
11.29-Paper reading-OmniVL
No ratings yet
11.29-Paper reading-OmniVL
25 pages
AGI in Computer Vision Insights from GPT
No ratings yet
AGI in Computer Vision Insights from GPT
59 pages
Converting From NIS To Redhat Identity Management1
No ratings yet
Converting From NIS To Redhat Identity Management1
4 pages
Live Streaming Links for Sports Channels
No ratings yet
Live Streaming Links for Sports Channels
2 pages
Travel Agency C++ Project Report
No ratings yet
Travel Agency C++ Project Report
21 pages
Immunology Serology in Laboratory Medicine 5e Download Instantly
No ratings yet
Immunology Serology in Laboratory Medicine 5e Download Instantly
303 pages
Customer Service Representative2 Template 18
No ratings yet
Customer Service Representative2 Template 18
1 page
Code P0705
No ratings yet
Code P0705
6 pages
Sabp L 006
No ratings yet
Sabp L 006
11 pages
(Day-1) - Power BI Do-it-Yourself
No ratings yet
(Day-1) - Power BI Do-it-Yourself
2 pages
Universidad Politécnica Salesiana Ingeniería Electrónica: Prueba
No ratings yet
Universidad Politécnica Salesiana Ingeniería Electrónica: Prueba
8 pages
Python Experiments
No ratings yet
Python Experiments
13 pages
DVD-V5500 DVD-V6000 DVD-V6500: User's Manual
No ratings yet
DVD-V5500 DVD-V6000 DVD-V6500: User's Manual
68 pages
Resumo Maris Ecdis 900
100% (3)
Resumo Maris Ecdis 900
77 pages
Vignan's Institute of Engineering For Women
No ratings yet
Vignan's Institute of Engineering For Women
2 pages
Manual UF2 Engl PDF
No ratings yet
Manual UF2 Engl PDF
73 pages
Neom Oxagon Local Control & Main PLC Panel Material Schedule 17-10-2022
No ratings yet
Neom Oxagon Local Control & Main PLC Panel Material Schedule 17-10-2022
1 page
Computer Networks Assignment 3
No ratings yet
Computer Networks Assignment 3
2 pages
CCTV Policy and Guidelines MS Word
No ratings yet
CCTV Policy and Guidelines MS Word
38 pages
Rexroth SYNAX 200: Troubleshooting Guide
100% (1)
Rexroth SYNAX 200: Troubleshooting Guide
112 pages
Bpms Project Report
No ratings yet
Bpms Project Report
68 pages
Wii Hacking Guide for Beginners
No ratings yet
Wii Hacking Guide for Beginners
8 pages
Online Book-Bank System Report
No ratings yet
Online Book-Bank System Report
72 pages
Cat - Recent Models - 3D CAD Model Collection - GrabCAD Community Library
No ratings yet
Cat - Recent Models - 3D CAD Model Collection - GrabCAD Community Library
4 pages
Accelerometer-Based Accident Detection System
No ratings yet
Accelerometer-Based Accident Detection System
5 pages
School Management System in Python
No ratings yet
School Management System in Python
12 pages
Conversion of HTML To WordPress
No ratings yet
Conversion of HTML To WordPress
7 pages
Posiwire Ws17Kt Analog, Ssi or Canopen Output: Compact Sensor For Medium Ranges
No ratings yet
Posiwire Ws17Kt Analog, Ssi or Canopen Output: Compact Sensor For Medium Ranges
9 pages
Helicopter FSTD Training Guide
No ratings yet
Helicopter FSTD Training Guide
32 pages
PPC Assignment
No ratings yet
PPC Assignment
50 pages
Anliso
No ratings yet
Anliso
16 pages
Lab 1 Linear Lna Design 195
No ratings yet
Lab 1 Linear Lna Design 195
49 pages