Vision Transformers
Biplab Banerjee
GNR 650
Sequence modeling using RNN
It does a sequential processing, so all the past tokens has to be processed before attending the next token
We can use CNN for the same purpose
Hierarchical convolution for managing the
sequence
Temporal convolution
It can perform local feature aggregation within a short pre-defined window
What are the major problems and way
forward
• Data are processed in sequence – less parallelism
• Long-range dependencies are not modelled properly –
independence between far away patches
• Tokens – groups of neurons representing an entity encapsulating
group of information, like an image patch
• How to tokenize the image data? What are the operations
performed on the tokens?
Some operations
• What is linear combinations
• For a neuron
• For tokens
• What can be non-linearity for tokens?
Patch embedding
Ways for patch embedding
• Convolution followed by flattening
• Flattening the patch followed by fully connected layer
• What one is logically correct?
Problems with the second approach
• C(i, j) and C(i+1, j) within each patch do end up far apart in flattened
vectors. This creates two distinct issues:
• Due to 16x16 kernels, or any other big ones, local pixel-level features and
relationships are disregarded.
• Due to stride 16, or any other stride equal to a kernel size, information
along patch borders is somewhat underrepresented.
Small patch size are good,
But there will be many small patches,
Hence, is not space efficient
eyes and the nose ended up split into separate patches, even though they
can be considered individual semantic units
One solution – transformer in transformer
Bigger patch – sentence, Smaller patch - word
Position masking
• It is possible to mask some tokens for some specific tasks – masked
image modeling
[mask] token is learnable
How can we get global knowledge of the
image
• We have patch embeddings – but none of the patches refer to the
entire image
• Pooling of the patch embeddings?
• We can introduce a new learnable token to capture the global
image information – [CLS] token
Till now, permutation invariant
Since the patch embedding network share parameters across all the patches, any ordering of the patches will have the same effect
Positional encoding
We want to make the positional
vector learnable, hence directly learning
1-D representations as of the patch embedding shape
High correlation on the encoding vectors for the nearby patches
Ways for positional encoding
• Add the absolute position of the patch [0 – K]
• As the number of patches grow, K will become large, hence, will cause large
magnitude for the position vector
• We can normalize the vector by dividing by K
• For two words with K1 and K2 tokens, same position will have different values in the
respective embedding, thus causing confusion in the position
• We want the positional encoding function to be
• Bounded
• Continuous (why?)
• Should also handle the relative distance between two encodings (absolute vs relative
encodings)
Sinusoidal PE – a function that has the
properties
binary values would be a waste of space in the world of floats. So instead,
we can use their float continous counterparts - Sinusoidal functions.
Indeed, they are the equivalent to alternating bits. Moreover, By decreasing
their frequencies, we can go from red bits to orange ones.
How does it ensures relative positioning
• We chose this function because we hypothesized it would allow the model to easily learn to
attend by relative positions, since for any fixed offset k, PEpos+k can be represented as a linear
function of PEpos.
Rotation matric in 2D
More detailed view
Attention in CNN
Bottleneck attention module
Convolutional block attention module
Attention in captioning
Some qualitative results
Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄
Self Attention Key
What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Query Key Value
𝑊𝐾
𝑊𝑄 𝑊𝑉
Attention Is All You Need, Vaswani [Link]., 2017 – [1]
Research Trend, D1 Anditya Arifianto 38
Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄
Self Attention Key
What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Quer Key
y
Relevanc
e
Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]
Research Trend, D1 Anditya Arifianto 39
Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄
Self Attention Key
What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Quer Key
y
Relevanc
e
Weighting
Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]
Research Trend, D1 Anditya Arifianto 40
Query What do I want to know? 𝑄 ∈ ℝ𝑛𝑜𝑢𝑡 ×𝑑𝑄
Self Attention Key
What do I use to determine the
relevance?
𝐾 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Value What do I want to summarize? 𝑉 ∈ ℝ𝑛𝑖𝑛×𝑑𝐾
Quer Key
y
Result
Relevanc
e
Weighting
Value Attention Is All You Need, Vaswani [Link]., 2017 – [1]
Research Trend, D1 Anditya Arifianto 41
The attention we considered
Lets make some changes to generalize
To take care of large magnitudes
Introducing queries
Introducing keys and values
The full picture
Self attention in ViT
Some insights
1. The process corresponds to find a linear transformation where the context is used to associate words in sentences
2. Apple should be close to orange in one sentence, and be close to phone in another sentence
Some insights
Multi-head self attention
Each heads learn to check the similarity
Based on different feature properties
Some perspectives
CNN with self-attention module
Transformer decoder
A generic view of self attention
Self to cross attention
Masked self attention
• The transformer decoder is auto-regressive during testing but non-auto-regressive during training
• It means the target tokens, for example, for machine translation task, should be outputted in parallel.
• It means to produce a given token, all the future tokens are also considered to define the self attention, but this is logically wrong.
• How to take care of this while making the self-attention matrix in the decoder?
Batch-norm vs layer-norm
For data with different context sizes, batch-norm fails, layer-norm, which uses a single instance is a better choice
RESULTS
Attention Map Visualisation
Representative examples of attention from the output token to the input space
Results for Different Model Variations
Model Variation CIFAR10 Train CIFAR10 Test CIFAR100 Train CIFAR100 Test
Accuracy (%) Accuracy (%) Accuracy (%) Accuracy (%)
Vision Transformer 64.3 57.2 62.1 37.6
(12 Layer, Patch-Size-8)
Image Size - 32x32
Vision Transformer 82.1 71.3 83.8 40.7
(12 Layer, Patch-Size-4)
Image Size - 32x32
Vision Transformer 80.2 71.9 83.5 43.8
(8 Layer, Patch-Size-4)
Image Size - 32x32
Hybrid Vision Transformer 90.9 80.0 96.3 54.6
(12 Layer, Patch-Size-7)
Image Size - 224x224
ResNet34 98.4 92.2 98.6 69.4
(From Scratch)
Image Size - 224x224
PreTrained Vision 99.3 98.1 97.9 87.2
Transformer (12 Layer,
Patch-Size-16)
Image Size - 224x224
Inference from our Results
● Patch size in the Vision Transformer decides the length of the
sequence. Lower patch size leads to higher information exchange
during the self attention mechanism. This is verified by the better
results using lower patch-size 4 over 8 on a 32x32 image.
● Increasing the number of layers of the Vision Transformer should
ideally lead to better results but the results on the 8 Layer model
are marginally better than the 12 Layer model which can be
attributed to the small datasets used to train the models. Models
with higher complexity require more data to capture the image
features.
Inference from our Results
● As noted in the paper, Hybrid Vision Transformer performs better on
small datasets compared to ViT as the initial ResNet features are able to
capture the lower level features due to the locality property of
Convolutions which normal ViT is not able to capture with the limited
data available for training.
● ResNets trained from scratch are able to outperform both ViT and
Hybrid-ViT trained from scratch due to its inherent inductive bias of
locality and translation invariance. These biases can not learned by the
ViT on small datasets.
● PreTrained ViT performs much better than the other methods due to
being trained on huge datasets and thus having learned the better
representations than even ResNet since it can access much further
information right from the very beginning unlike CNN.
Train vs Test Accuracy Graphs (CIFAR10)
ViT ViT
Layer - 12 Layer - 12
Patch size - 8 Patch size - 4
Hybrid ViT
Layer - 12 ResNet34
Patch size - 7
Train vs Test Accuracy Graphs (CIFAR100)
ViT ViT
Layer - 12 Layer - 12
Patch size - 8 Patch size - 4
Hybrid ViT
Layer - 12 ResNet34
Patch size - 7
Transformers are invariant to