Vision Transformers:
Revolutionizing
Computer Vision
Transformers revolutionized NLP, and now they're transforming Vision too.
RISHAV GOYAL
2023UGCS116
Background: From CNNs
to Transformers
CNNs long dominated computer vision for local pattern recognition, but
their limitations paved the way for Transformers.
Limitations of CNNs Why Explore
Transformers?
Struggle with long-range
dependencies. Capture global relationships via
High computational costs for self-attention.
complex tasks. Minimal inductive bias for
Inductive bias limits flexibility. flexible learning.
Poor global understanding. Highly scalable for large data.
Proven success in NLP suggests
strong potential for vision.
What is a Transformer?
(Quick Recap)
The core of a Transformer is its powerful self-attention mechanism,
allowing it to weigh different parts of the input sequence.
Its key innovation is learning explicit relationships between all input
"tokens" simultaneously, capturing global dependencies effectively.
In computer vision tasks, this bridges to images by treating fixed-size
image patches as individual tokens for processing.
Transformers process input sequences in parallel, unlike sequential
models, significantly enhancing training speed and scalability.
Vision Transformer
(ViT): Core Idea
01 02
Image Patching Flatten & Embed
Image divided into small, fixed-size Each patch flattened, linearly
patches. embedded.
03 04
Positional Encoding Transformer Encoder
Positional information added to Embedded patches fed into
patch embeddings. Transformer encoder.
05 06
Self-Attention Global Context
Encoder processes patches using Learns global relationships across
self-attention. image.
07
Classification
Final classification via a special class token.
Architecture of ViT
Patch embeddings: Image patches are transformed into linear
projections.
Positional encoding: Adds spatial information to patches, as order is lost
during flattening.
Transformer encoder blocks: Process embedded patches using multi-
head self-attention and FFNs.
Class token & classification head: A learnable token gathers global
information, feeding into the final classifier.
Workflow of Vision
Transformers (Detailed)
Stage 1: Input image split into N patches
Stage 2: Each patch ³ linear embedding ³ + positional encoding
Stage 3: Transformer encoder processes tokens with self-attention
Stage 4: [CLS] token aggregates info ³ classifier predicts label
Positional encoding preserves spatial information
Multi-head attention allows for different relationship types
Feed-forward networks process features after attention
Advantages of ViTs
Captures global relationships and context across the entire image.
Scales effectively, showing improved performance with larger datasets.
Achieves state-of-the-art results on many vision benchmarks.
Flexible architecture adaptable to various computer vision tasks.
Has less inductive bias, enabling more general learning.
Enhanced interpretability, as attention maps provide visual insights into feature importance.
Superior transfer learning capabilities, excelling at adapting to new tasks with limited data.
Challenges / Limitations
Requires large datasets: ViTs lack CNN's inductive biases, needing vast data to generalize.
High computational & memory costs: Self-attention scales quadratically with input.
Limited inherent understanding of locality: Doesn't inherently capture fine-grained local spatial relationships.
Slower inference speed: Especially for real-time applications due to extensive self-attention.
Complex hyperparameter tuning: Requires careful tuning for optimal performance.
Limited interpretability: Full understanding of complex decision-making remains a challenge.
Applications of ViTs
Image classification
Object detection (e.g., DETR)
Medical imaging, satellite vision, video tasks
Multi-modal tasks (image + text)
Video understanding
3D point cloud processing
Facial recognition and analysis