0% found this document useful (0 votes)

27 views69 pages

Understanding Vision Transformers

The document discusses Vision Transformers and their advantages over traditional RNNs and CNNs for sequence modeling. It highlights challenges such as sequential processing, long-range dependencies, and tokenization of image data, proposing solutions like transformer architectures and positional encoding. Results indicate that patch size and model complexity significantly affect performance, with Hybrid Vision Transformers and pretrained models outperforming standard Vision Transformers on smaller datasets.

Uploaded by

Rajdip Ingale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views69 pages

Understanding Vision Transformers

Uploaded by

Rajdip Ingale

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Vision Transformers

Biplab Banerjee
GNR 650
Sequence modeling using RNN

It does a sequential processing, so all the past tokens has to be processed before attending the next token
We can use CNN for the same purpose
Hierarchical convolution for managing the
sequence
Temporal convolution
It can perform local feature aggregation within a short pre-defined window
What are the major problems and way
forward
• Data are processed in sequence – less parallelism

• Long-range dependencies are not modelled properly –

independence between far away patches

• Tokens – groups of neurons representing an entity encapsulating

group of information, like an image patch

• How to tokenize the image data? What are the operations

performed on the tokens?
Some operations
• What is linear combinations
• For a neuron
• For tokens

• What can be non-linearity for tokens?

Patch embedding
Ways for patch embedding
• Convolution followed by flattening

• Flattening the patch followed by fully connected layer

• What one is logically correct?

Problems with the second approach
• C(i, j) and C(i+1, j) within each patch do end up far apart in flattened
vectors. This creates two distinct issues:

• Due to 16x16 kernels, or any other big ones, local pixel-level features and
relationships are disregarded.

• Due to stride 16, or any other stride equal to a kernel size, information
along patch borders is somewhat underrepresented.
Small patch size are good,
But there will be many small patches,
Hence, is not space efficient

eyes and the nose ended up split into separate patches, even though they
can be considered individual semantic units
One solution – transformer in transformer

Bigger patch – sentence, Smaller patch - word

Position masking
• It is possible to mask some tokens for some specific tasks – masked
image modeling

[mask] token is learnable

How can we get global knowledge of the
image
• We have patch embeddings – but none of the patches refer to the
entire image

• Pooling of the patch embeddings?

• We can introduce a new learnable token to capture the global

image information – [CLS] token
Till now, permutation invariant

Since the patch embedding network share parameters across all the patches, any ordering of the patches will have the same effect
Positional encoding

We want to make the positional

vector learnable, hence directly learning
1-D representations as of the patch embedding shape
High correlation on the encoding vectors for the nearby patches
Ways for positional encoding
• Add the absolute position of the patch [0 – K]
• As the number of patches grow, K will become large, hence, will cause large
magnitude for the position vector
• We can normalize the vector by dividing by K
• For two words with K1 and K2 tokens, same position will have different values in the
respective embedding, thus causing confusion in the position

• We want the positional encoding function to be

• Bounded
• Continuous (why?)
• Should also handle the relative distance between two encodings (absolute vs relative
encodings)
Sinusoidal PE – a function that has the
properties
binary values would be a waste of space in the world of floats. So instead,
we can use their float continous counterparts - Sinusoidal functions.
Indeed, they are the equivalent to alternating bits. Moreover, By decreasing
their frequencies, we can go from red bits to orange ones.
How does it ensures relative positioning
• We chose this function because we hypothesized it would allow the model to easily learn to
attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear
function of PEpos.

Rotation matric in 2D
More detailed view
Attention in CNN
Bottleneck attention module
Convolutional block attention module
Attention in captioning
Some qualitative results
Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄

Self Attention Key

What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Query Key Value

𝑊𝐾
𝑊𝑄 𝑊𝑉

Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 38

Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄

Self Attention Key

What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Quer Key
y

Relevanc
e

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 39

Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄

Self Attention Key

What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Quer Key
y

Relevanc
e

Weighting

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 40

Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄

Self Attention Key

What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Quer Key
y

Result

Relevanc
e

Weighting

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 41

The attention we considered
Lets make some changes to generalize
To take care of large magnitudes
Introducing queries
Introducing keys and values
The full picture
Self attention in ViT
Some insights

1. The process corresponds to find a linear transformation where the context is used to associate words in sentences

2. Apple should be close to orange in one sentence, and be close to phone in another sentence
Some insights
Multi-head self attention

Each heads learn to check the similarity

Based on different feature properties
Some perspectives
CNN with self-attention module
Transformer decoder
A generic view of self attention
Self to cross attention
Masked self attention
• The transformer decoder is auto-regressive during testing but non-auto-regressive during training

• It means the target tokens, for example, for machine translation task, should be outputted in parallel.

• It means to produce a given token, all the future tokens are also considered to define the self attention, but this is logically wrong.

• How to take care of this while making the self-attention matrix in the decoder?
Batch-norm vs layer-norm

For data with different context sizes, batch-norm fails, layer-norm, which uses a single instance is a better choice
RESULTS
Attention Map Visualisation

Representative examples of attention from the output token to the input space
Results for Different Model Variations
Model Variation CIFAR10 Train CIFAR10 Test CIFAR100 Train CIFAR100 Test
Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%)

Vision Transformer 64.3 57.2 62.1 37.6

(12 Layer, Patch-Size-8)
Image Size - 32x32

Vision Transformer 82.1 71.3 83.8 40.7

(12 Layer, Patch-Size-4)
Image Size - 32x32

Vision Transformer 80.2 71.9 83.5 43.8

(8 Layer, Patch-Size-4)
Image Size - 32x32

Hybrid Vision Transformer 90.9 80.0 96.3 54.6

(12 Layer, Patch-Size-7)
Image Size - 224x224

ResNet34 98.4 92.2 98.6 69.4

(From Scratch)
Image Size - 224x224

PreTrained Vision 99.3 98.1 97.9 87.2

Transformer (12 Layer,
Patch-Size-16)
Image Size - 224x224
Inference from our Results
● Patch size in the Vision Transformer decides the length of the
sequence. Lower patch size leads to higher information exchange
during the self attention mechanism. This is verified by the better
results using lower patch-size 4 over 8 on a 32x32 image.
● Increasing the number of layers of the Vision Transformer should
ideally lead to better results but the results on the 8 Layer model
are marginally better than the 12 Layer model which can be
attributed to the small datasets used to train the models. Models
with higher complexity require more data to capture the image
features.
Inference from our Results
● As noted in the paper, Hybrid Vision Transformer performs better on
small datasets compared to ViT as the initial ResNet features are able to
capture the lower level features due to the locality property of
Convolutions which normal ViT is not able to capture with the limited
data available for training.
● ResNets trained from scratch are able to outperform both ViT and
Hybrid-ViT trained from scratch due to its inherent inductive bias of
locality and translation invariance. These biases can not learned by the
ViT on small datasets.
● PreTrained ViT performs much better than the other methods due to
being trained on huge datasets and thus having learned the better
representations than even ResNet since it can access much further
information right from the very beginning unlike CNN.
Train vs Test Accuracy Graphs (CIFAR10)

ViT ViT
Layer - 12 Layer - 12
Patch size - 8 Patch size - 4

Hybrid ViT
Layer - 12 ResNet34
Patch size - 7
Train vs Test Accuracy Graphs (CIFAR100)

ViT ViT
Layer - 12 Layer - 12
Patch size - 8 Patch size - 4

Hybrid ViT
Layer - 12 ResNet34
Patch size - 7
Transformers are invariant to

Deep Learning Architectures Overview
No ratings yet
Deep Learning Architectures Overview
65 pages
CO 2 6 Transformers
No ratings yet
CO 2 6 Transformers
27 pages
2025 Transformer
No ratings yet
2025 Transformer
56 pages
Overview of Deep Learning Architectures
No ratings yet
Overview of Deep Learning Architectures
69 pages
Vision Transformers Explained: CNNs vs. ViTs
No ratings yet
Vision Transformers Explained: CNNs vs. ViTs
63 pages
Vision Transformer for Diabetic Retinopathy
No ratings yet
Vision Transformer for Diabetic Retinopathy
30 pages
Understanding Transformer Models in NLP
No ratings yet
Understanding Transformer Models in NLP
50 pages
Understanding Vision Transformers (ViT)
No ratings yet
Understanding Vision Transformers (ViT)
15 pages
Face Transformer for Recognition Performance
No ratings yet
Face Transformer for Recognition Performance
5 pages
Vision Transformers Overview and Challenges
No ratings yet
Vision Transformers Overview and Challenges
8 pages
Introduction to Transformer Architecture
No ratings yet
Introduction to Transformer Architecture
10 pages
Interpreting Attention in Vision Transformers
No ratings yet
Interpreting Attention in Vision Transformers
152 pages
Free Download: Applied AI Course on Transformers
No ratings yet
Free Download: Applied AI Course on Transformers
50 pages
An Introduction To Transformers: Ret26@cam - Ac.uk
No ratings yet
An Introduction To Transformers: Ret26@cam - Ac.uk
10 pages
Comprehensive Guide to Transformers
No ratings yet
Comprehensive Guide to Transformers
30 pages
Vision Transformers: Architecture & Insights
No ratings yet
Vision Transformers: Architecture & Insights
28 pages
Deep Learning Overview and Techniques
No ratings yet
Deep Learning Overview and Techniques
271 pages
Unimodal Representations in CNNs
No ratings yet
Unimodal Representations in CNNs
92 pages
Efficient Sparse Attention in Transformers
No ratings yet
Efficient Sparse Attention in Transformers
115 pages
Transformer-iN-Transformer Model for Vision
No ratings yet
Transformer-iN-Transformer Model for Vision
10 pages
Vision Transformers for Video Action Prediction
No ratings yet
Vision Transformers for Video Action Prediction
9 pages
Conditional Positional Encoding for Vision Transformers
No ratings yet
Conditional Positional Encoding for Vision Transformers
13 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Generative Vision: LVMs & Diffusion Models
No ratings yet
Generative Vision: LVMs & Diffusion Models
75 pages
Taming Transformers for Image Synthesis
No ratings yet
Taming Transformers for Image Synthesis
52 pages
Transformers in Computer Vision Usage
No ratings yet
Transformers in Computer Vision Usage
77 pages
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
No ratings yet
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
11 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
22 pages
Vision Transformer Architecture for Images
No ratings yet
Vision Transformer Architecture for Images
2 pages
Overview of Transformer Architecture
No ratings yet
Overview of Transformer Architecture
22 pages
Vision Transformer
No ratings yet
Vision Transformer
6 pages
AI Vision Boot Camp_ CNNs to Vision Transformers
No ratings yet
AI Vision Boot Camp_ CNNs to Vision Transformers
14 pages
Deep Learning
No ratings yet
Deep Learning
6 pages
CSE 317 Project Explanation (1)
No ratings yet
CSE 317 Project Explanation (1)
5 pages
An Introduction To Transformers
No ratings yet
An Introduction To Transformers
10 pages
Vision Transformers for Image Recognition
No ratings yet
Vision Transformers for Image Recognition
21 pages
Build a Vision Language Model in PyTorch
No ratings yet
Build a Vision Language Model in PyTorch
21 pages
Self-Attention Mechanism Explained
No ratings yet
Self-Attention Mechanism Explained
18 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
7 pages
Introduction to AI Course Overview
100% (1)
Introduction to AI Course Overview
62 pages
ImageNet and Machine Learning Insights
No ratings yet
ImageNet and Machine Learning Insights
11 pages
Transformer Architecture Overview
No ratings yet
Transformer Architecture Overview
32 pages
Transformers in Computer Vision Survey
No ratings yet
Transformers in Computer Vision Survey
30 pages
Training Visual Transformers on Small Datasets
No ratings yet
Training Visual Transformers on Small Datasets
13 pages
Self-Attention and Transformers Explained
No ratings yet
Self-Attention and Transformers Explained
25 pages
Recent Architectures in Neural Rendering
No ratings yet
Recent Architectures in Neural Rendering
52 pages
Understanding Transformer Encoders
No ratings yet
Understanding Transformer Encoders
19 pages
Understanding Transformers: Key Concepts
No ratings yet
Understanding Transformers: Key Concepts
16 pages
Understanding Transformers in AI
No ratings yet
Understanding Transformers in AI
69 pages
Neural Image Compression Framework
No ratings yet
Neural Image Compression Framework
19 pages
Convolutional Neural Networks Overview
No ratings yet
Convolutional Neural Networks Overview
76 pages
CNNs: Foundations and Applications
No ratings yet
CNNs: Foundations and Applications
9 pages
Attention Mechanisms in NMT
No ratings yet
Attention Mechanisms in NMT
41 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
20 pages
Understanding the Transformer Model
No ratings yet
Understanding the Transformer Model
32 pages
DLBasic Lec4 Transformer Networks
No ratings yet
DLBasic Lec4 Transformer Networks
54 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
38 pages
Join Nexus Cognitive: Data Careers
No ratings yet
Join Nexus Cognitive: Data Careers
5 pages
Agentic Workflows in LLMs
No ratings yet
Agentic Workflows in LLMs
43 pages
Deep Learning for Image Analysis
No ratings yet
Deep Learning for Image Analysis
37 pages
Function Calling in Large Language Models
No ratings yet
Function Calling in Large Language Models
43 pages
Tool Augmentation in LLMs Explained
No ratings yet
Tool Augmentation in LLMs Explained
46 pages
Preventing AI Hallucinations in LLMs
No ratings yet
Preventing AI Hallucinations in LLMs
9 pages
Truth and Lies: Guard Puzzle Solutions
No ratings yet
Truth and Lies: Guard Puzzle Solutions
4 pages
Climate Change's Impact on Food Security
No ratings yet
Climate Change's Impact on Food Security
1 page
LLM Reasoning: Key Insights & Challenges
No ratings yet
LLM Reasoning: Key Insights & Challenges
87 pages
PySpark Interview Questions & Answers
No ratings yet
PySpark Interview Questions & Answers
8 pages
Environmental and Social Issues Overview
No ratings yet
Environmental and Social Issues Overview
32 pages
Lonavala Travel Guide and Insights
No ratings yet
Lonavala Travel Guide and Insights
12 pages
SIDEC Mexico Accreditation Certificate
No ratings yet
SIDEC Mexico Accreditation Certificate
20 pages
Dampak Negatif Media Sosial pada Anak
No ratings yet
Dampak Negatif Media Sosial pada Anak
3 pages
World Neighbors Logo Guidelines
No ratings yet
World Neighbors Logo Guidelines
10 pages
UAV Swarm Search via Reinforcement Learning
No ratings yet
UAV Swarm Search via Reinforcement Learning
11 pages
B.A. Public Administration Syllabus
No ratings yet
B.A. Public Administration Syllabus
2 pages
Quantum Mechanics II Exam Questions
No ratings yet
Quantum Mechanics II Exam Questions
4 pages
Homeroom Guidance: Quarter 4 - Module 9: It's You and I That Makes The World Go Round
70% (23)
Homeroom Guidance: Quarter 4 - Module 9: It's You and I That Makes The World Go Round
19 pages
TTT Diagram for Eutectoid Steel
No ratings yet
TTT Diagram for Eutectoid Steel
32 pages
Geometry Lesson Plan for Grade 8
No ratings yet
Geometry Lesson Plan for Grade 8
6 pages
UK Weather: Seasons and Changes
No ratings yet
UK Weather: Seasons and Changes
5 pages
Telia Company Green Bond Report 2023
No ratings yet
Telia Company Green Bond Report 2023
10 pages
Job Application Letter by Hlengiwe Ngcobo
No ratings yet
Job Application Letter by Hlengiwe Ngcobo
3 pages
Qspiders Aptitude Test Mock Questions
100% (1)
Qspiders Aptitude Test Mock Questions
64 pages
Cavitation Bubble Collapse Simulation
No ratings yet
Cavitation Bubble Collapse Simulation
9 pages
English Literacy Exam Preparation
No ratings yet
English Literacy Exam Preparation
4 pages
Time-Optimal Control for Quadcopters
No ratings yet
Time-Optimal Control for Quadcopters
104 pages
Mafinga Skills Training Centre ESMP
No ratings yet
Mafinga Skills Training Centre ESMP
23 pages
Top 100 US F1 Visa Interview Questions
No ratings yet
Top 100 US F1 Visa Interview Questions
5 pages
Heredity: Key Questions and Answers
No ratings yet
Heredity: Key Questions and Answers
3 pages
SIH 2025 Finalist Team Scores
No ratings yet
SIH 2025 Finalist Team Scores
4 pages
YouTube Installation Errors
No ratings yet
YouTube Installation Errors
2 pages
Sigma D Series Fault Indicators Overview
No ratings yet
Sigma D Series Fault Indicators Overview
2 pages
Herbarium Preparation and Microscope Study
No ratings yet
Herbarium Preparation and Microscope Study
17 pages
Enhanced Deep Learning for Early Cancer Detection
No ratings yet
Enhanced Deep Learning for Early Cancer Detection
6 pages
EasyRET Dual-Band Antenna Specs
No ratings yet
EasyRET Dual-Band Antenna Specs
2 pages
Understanding Number Systems in Computing
No ratings yet
Understanding Number Systems in Computing
7 pages
ICC Guidelines for Reliability Research
No ratings yet
ICC Guidelines for Reliability Research
9 pages
AAC Properties and Microstructure Analysis
No ratings yet
AAC Properties and Microstructure Analysis
7 pages

Understanding Vision Transformers

Uploaded by

Understanding Vision Transformers

Uploaded by

Vision Transformers

• Long-range dependencies are not modelled properly –

• Tokens – groups of neurons representing an entity encapsulating

• How to tokenize the image data? What are the operations

• What can be non-linearity for tokens?

• Flattening the patch followed by fully connected layer

• What one is logically correct?

Bigger patch – sentence, Smaller patch - word

[mask] token is learnable

• Pooling of the patch embeddings?

• We can introduce a new learnable token to capture the global

We want to make the positional

• We want the positional encoding function to be

Self Attention Key

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Query Key Value

Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 38

Self Attention Key

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 39

Self Attention Key

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 40

Self Attention Key

Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾

Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]

Research Trend, D1 Anditya Arifianto 41

Each heads learn to check the similarity

Vision Transformer 64.3 57.2 62.1 37.6

Vision Transformer 82.1 71.3 83.8 40.7

Vision Transformer 80.2 71.9 83.5 43.8

Hybrid Vision Transformer 90.9 80.0 96.3 54.6

ResNet34 98.4 92.2 98.6 69.4

PreTrained Vision 99.3 98.1 97.9 87.2

You might also like