0% found this document useful (0 votes)

18 views9 pages

Vision Transformers Revolutionizing Computer Vision

Vision Transformers (ViTs) are transforming computer vision by leveraging self-attention mechanisms to capture global relationships in images, overcoming limitations of traditional CNNs. They process images as patches, enabling efficient parallel processing and scalability, while achieving state-of-the-art results across various vision tasks. However, ViTs require large datasets and have high computational costs, posing challenges for real-time applications and interpretability.

Uploaded by

hello.rishav123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views9 pages

Vision Transformers Revolutionizing Computer Vision

Uploaded by

hello.rishav123

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 9

Vision Transformers:

Revolutionizing
Computer Vision
Transformers revolutionized NLP, and now they're transforming Vision too.

RISHAV GOYAL

2023UGCS116
Background: From CNNs
to Transformers
CNNs long dominated computer vision for local pattern recognition, but
their limitations paved the way for Transformers.

Limitations of CNNs Why Explore

Transformers?
Struggle with long-range
dependencies. Capture global relationships via

High computational costs for self-attention.

complex tasks. Minimal inductive bias for

Inductive bias limits flexibility. flexible learning.

Poor global understanding. Highly scalable for large data.

Proven success in NLP suggests
strong potential for vision.
What is a Transformer?
(Quick Recap)
The core of a Transformer is its powerful self-attention mechanism,
allowing it to weigh different parts of the input sequence.
Its key innovation is learning explicit relationships between all input
"tokens" simultaneously, capturing global dependencies effectively.
In computer vision tasks, this bridges to images by treating fixed-size
image patches as individual tokens for processing.
Transformers process input sequences in parallel, unlike sequential
models, significantly enhancing training speed and scalability.
Vision Transformer
(ViT): Core Idea
01 02

Image Patching Flatten & Embed

Image divided into small, fixed-size Each patch flattened, linearly
patches. embedded.

03 04

Positional Encoding Transformer Encoder

Positional information added to Embedded patches fed into
patch embeddings. Transformer encoder.

05 06

Self-Attention Global Context

Encoder processes patches using Learns global relationships across
self-attention. image.

Classification
Final classification via a special class token.
Architecture of ViT
Patch embeddings: Image patches are transformed into linear
projections.
Positional encoding: Adds spatial information to patches, as order is lost
during flattening.
Transformer encoder blocks: Process embedded patches using multi-
head self-attention and FFNs.
Class token & classification head: A learnable token gathers global
information, feeding into the final classifier.
Workflow of Vision
Transformers (Detailed)
Stage 1: Input image split into N patches
Stage 2: Each patch ³ linear embedding ³ + positional encoding
Stage 3: Transformer encoder processes tokens with self-attention
Stage 4: [CLS] token aggregates info ³ classifier predicts label
Positional encoding preserves spatial information

Multi-head attention allows for different relationship types

Feed-forward networks process features after attention
Advantages of ViTs
Captures global relationships and context across the entire image.
Scales effectively, showing improved performance with larger datasets.
Achieves state-of-the-art results on many vision benchmarks.
Flexible architecture adaptable to various computer vision tasks.
Has less inductive bias, enabling more general learning.
Enhanced interpretability, as attention maps provide visual insights into feature importance.

Superior transfer learning capabilities, excelling at adapting to new tasks with limited data.
Challenges / Limitations
Requires large datasets: ViTs lack CNN's inductive biases, needing vast data to generalize.
High computational & memory costs: Self-attention scales quadratically with input.
Limited inherent understanding of locality: Doesn't inherently capture fine-grained local spatial relationships.
Slower inference speed: Especially for real-time applications due to extensive self-attention.
Complex hyperparameter tuning: Requires careful tuning for optimal performance.
Limited interpretability: Full understanding of complex decision-making remains a challenge.
Applications of ViTs
Image classification
Object detection (e.g., DETR)
Medical imaging, satellite vision, video tasks
Multi-modal tasks (image + text)
Video understanding
3D point cloud processing

Facial recognition and analysis

ViT Transformers SEMINAR
No ratings yet
ViT Transformers SEMINAR
16 pages
Gaurav Vision Transformer
No ratings yet
Gaurav Vision Transformer
10 pages
Research Notes
No ratings yet
Research Notes
9 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
23 pages
Vision Transformers: Revolutionizing Computer Vision
No ratings yet
Vision Transformers: Revolutionizing Computer Vision
14 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
23 pages
Vision Transformer: Revolutionizing Computer Vision
No ratings yet
Vision Transformer: Revolutionizing Computer Vision
13 pages
Video Vision Transformer Models
No ratings yet
Video Vision Transformer Models
14 pages
2024 GVT Shan Chen Arxiv
No ratings yet
2024 GVT Shan Chen Arxiv
9 pages
Paper 3
No ratings yet
Paper 3
7 pages
The Transformer Revolution How Attention Changed Everything
No ratings yet
The Transformer Revolution How Attention Changed Everything
10 pages
Token Labeling for Vision Transformers
No ratings yet
Token Labeling for Vision Transformers
13 pages
03 - ViViT - A Video Vision Transformer
No ratings yet
03 - ViViT - A Video Vision Transformer
13 pages
Ai Lakshmana Sai Vision Transformer
No ratings yet
Ai Lakshmana Sai Vision Transformer
19 pages
A Survey On Visual Transformer
No ratings yet
A Survey On Visual Transformer
21 pages
Video Quality Assessment (VQA) Using Vision Transformers
No ratings yet
Video Quality Assessment (VQA) Using Vision Transformers
5 pages
Convolutional Vision Transformers
No ratings yet
Convolutional Vision Transformers
10 pages
Vision Transformers in AI: Impact & Evolution
No ratings yet
Vision Transformers in AI: Impact & Evolution
3 pages
Vision Transformer Understanding
No ratings yet
Vision Transformer Understanding
3 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Deep Learning Paper About Vit
No ratings yet
Deep Learning Paper About Vit
12 pages
LLM
No ratings yet
LLM
28 pages
Video Quality Assessment with Vision Transformers
No ratings yet
Video Quality Assessment with Vision Transformers
6 pages
Transformers in Computational Visual Media A Surve
No ratings yet
Transformers in Computational Visual Media A Surve
30 pages
CVT: Introducing Convolutions To Vision Transformers
No ratings yet
CVT: Introducing Convolutions To Vision Transformers
10 pages
A Survey On Vision Transformer
No ratings yet
A Survey On Vision Transformer
24 pages
Vi Transformer
No ratings yet
Vi Transformer
21 pages
ppt2 New
No ratings yet
ppt2 New
30 pages
Paper 2
No ratings yet
Paper 2
8 pages
Self-Driving Cars: Vision Transformers
No ratings yet
Self-Driving Cars: Vision Transformers
5 pages
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
No ratings yet
AN IMAGE IS WORTH 16X16 WORDS TRANSFORMERS FOR IMAGE RECOGNITION AT SCALE Hirtika Mirghani
2 pages
Ai Int Arijit Dey PDF
No ratings yet
Ai Int Arijit Dey PDF
19 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
23 pages
DeiT Presentation
No ratings yet
DeiT Presentation
10 pages
Transformers For Vision A Survey On Innovative Methods For Computer Vision
No ratings yet
Transformers For Vision A Survey On Innovative Methods For Computer Vision
28 pages
ViT Robustness in Image Classification
No ratings yet
ViT Robustness in Image Classification
23 pages
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
No ratings yet
Comprehensive Survey of Model Compression and Speed Up For Vision Transformers - Chen Et Al
12 pages
Vision Transformer (ViT)
No ratings yet
Vision Transformer (ViT)
26 pages
AE-ViT: Enhancing Vision Transformers
No ratings yet
AE-ViT: Enhancing Vision Transformers
12 pages
Vision Transformer Seminar Report
No ratings yet
Vision Transformer Seminar Report
22 pages
Transformers in Vision & Diffusion
No ratings yet
Transformers in Vision & Diffusion
24 pages
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
No ratings yet
Vision Transformers in Medical Imaging: A Comprehensive Review of Advancements and Applications Across Multiple Diseases
44 pages
Vision Transformer (Vit) : Shusen Wang
No ratings yet
Vision Transformer (Vit) : Shusen Wang
35 pages
A Survey of Visual Transformers
No ratings yet
A Survey of Visual Transformers
21 pages
2151 6982 1 SM
No ratings yet
2151 6982 1 SM
6 pages
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
No ratings yet
Transformer in Transformer: Kai Han An Xiao Enhua Wu Jianyuan Guo Chunjing Xu Yunhe Wang
14 pages
The Transformer Revolution Reshaping AI Architecture
No ratings yet
The Transformer Revolution Reshaping AI Architecture
10 pages
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
No ratings yet
Vitae: Vision Transformer Advanced by Exploring Intrinsic Inductive Bias
23 pages
Seminar
No ratings yet
Seminar
61 pages
ViTA A Vision Transformer Inference Accelerator For Edge Applications
No ratings yet
ViTA A Vision Transformer Inference Accelerator For Edge Applications
5 pages
Vision Transformer U1
No ratings yet
Vision Transformer U1
42 pages
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
No ratings yet
Yin a-ViT Adaptive Tokens For Efficient Vision Transformer CVPR 2022 Paper
10 pages
Pre Trained Models CV PT2
No ratings yet
Pre Trained Models CV PT2
69 pages
NAS for Transformers: A Survey
No ratings yet
NAS for Transformers: A Survey
39 pages
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
No ratings yet
2021 NeurIPS VAAT Akbari, Yuan, Qian, Chuang, Chang, Cui, Gong
16 pages
Abstract
No ratings yet
Abstract
2 pages
Computer Vision
No ratings yet
Computer Vision
4 pages
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
No ratings yet
Bhojanapalli Understanding Robustness of Transformers For Image Classification ICCV 2021 Paper
11 pages
Bvit
No ratings yet
Bvit
12 pages
Climax: A Foundation Model For Weather and Climate
No ratings yet
Climax: A Foundation Model For Weather and Climate
41 pages
Automated Damage Detection in Pallet Racking
No ratings yet
Automated Damage Detection in Pallet Racking
13 pages
Detecting Pneumonia Using Vision Transformer and Comparing With Other Techniques
No ratings yet
Detecting Pneumonia Using Vision Transformer and Comparing With Other Techniques
5 pages
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
No ratings yet
Qwen2-VL: Enhancing Vision-Language Model's Perception of The World at Any Resolution
52 pages
BCSE313L Module2
No ratings yet
BCSE313L Module2
27 pages
Automated Fruit Quality Detection
No ratings yet
Automated Fruit Quality Detection
13 pages
9th SEM Final Project Amrita
No ratings yet
9th SEM Final Project Amrita
46 pages
CNC: Cross-Modal Normality Constraint For Unsupervised Multi-Class Anomaly Detection
No ratings yet
CNC: Cross-Modal Normality Constraint For Unsupervised Multi-Class Anomaly Detection
9 pages
AI Spatial Awareness Boost
No ratings yet
AI Spatial Awareness Boost
31 pages
Cite 3
No ratings yet
Cite 3
19 pages
Hyperbolic Chamfer Distance for Point Clouds
No ratings yet
Hyperbolic Chamfer Distance for Point Clouds
12 pages
Strudel Segmenter Transformer For Semantic Segmentation ICCV 2021 Paper
No ratings yet
Strudel Segmenter Transformer For Semantic Segmentation ICCV 2021 Paper
11 pages
Genconvit: Deepfake Video Detection Using Generative Convolutional Vision Transformer
No ratings yet
Genconvit: Deepfake Video Detection Using Generative Convolutional Vision Transformer
10 pages
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
No ratings yet
RF-DETR by Roboflow: Speed Meets Accuracy in Object Detection
13 pages
RL PTQ
No ratings yet
RL PTQ
6 pages
Thengane 等 - 2022 - CLIP Model is an Efficient Continual Learner
No ratings yet
Thengane 等 - 2022 - CLIP Model is an Efficient Continual Learner
13 pages
Efficient VQGANTowards High Resolution Image
No ratings yet
Efficient VQGANTowards High Resolution Image
10 pages
CCAI Guest Lecture - AI For Agriculture
No ratings yet
CCAI Guest Lecture - AI For Agriculture
62 pages
Cross-Modality Fusion Transformer For Multispectral Object Detection
No ratings yet
Cross-Modality Fusion Transformer For Multispectral Object Detection
11 pages
Exploring The Design Space of 3D Mllms For CT Report Generation
No ratings yet
Exploring The Design Space of 3D Mllms For CT Report Generation
12 pages
Wildfire and Smoke Early Detection For Drone Applications A Light Weight
No ratings yet
Wildfire and Smoke Early Detection For Drone Applications A Light Weight
12 pages
A Lightweight Transformer-Based Model For Fish Landmark Detection
No ratings yet
A Lightweight Transformer-Based Model For Fish Landmark Detection
9 pages
Animals 13 03134 v2
No ratings yet
Animals 13 03134 v2
19 pages
Plant Disease Project Report-1
0% (1)
Plant Disease Project Report-1
38 pages
2022 CholecT50
No ratings yet
2022 CholecT50
18 pages
Question Bank Prompt Engg
No ratings yet
Question Bank Prompt Engg
28 pages
Desmoke VCU Improved Unpaired Image To Image Translation - 2025 - Digital Signa
No ratings yet
Desmoke VCU Improved Unpaired Image To Image Translation - 2025 - Digital Signa
10 pages
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
No ratings yet
2021 AST - Audio Spectrogram Transformer Gong, Chung, Glass
5 pages
Iros24 0285 MS-2
No ratings yet
Iros24 0285 MS-2
8 pages

Vision Transformers Revolutionizing Computer Vision

Uploaded by

Vision Transformers Revolutionizing Computer Vision

Uploaded by

Vision Transformers:

Limitations of CNNs Why Explore

High computational costs for self-attention.

complex tasks. Minimal inductive bias for

Inductive bias limits flexibility. flexible learning.

Poor global understanding. Highly scalable for large data.

Image Patching Flatten & Embed

Positional Encoding Transformer Encoder

Self-Attention Global Context

Multi-head attention allows for different relationship types

Facial recognition and analysis

You might also like