Colloquium
COC3800
Vision Transformers:
Principles, Challenges, and
Emerging Trends
Group No.: 2
Group Details
1. Mohammad Faiz Umar (22COB107 / GM1536)
2. Eizad Hamdan (22COB154 / GL5628)
3. Aamina Siddiqui (22COB186 / GL8004)
4. Ayra Riaz Khan (22COB675 / GL4004)
Vision Transformers:
Principles, Challenges, and Emerging Trends
Abstract
Vision Transformers (ViTs) have emerged as a groundbreaking technology in computer vision, revolu-
tionizing how complex vision tasks are addressed by leveraging the self-attention mechanism. Tradition-
ally dominated by convolutional neural networks (CNNs), these tasks have significantly benefited from
ViTs’ ability to divide images into patches and process them sequentially, inheriting the scalability and
adaptability of transformers used in natural language processing. We will explore the foundational prin-
ciples that underpin ViTs, including their unique architecture, the role of the self-attention mechanism,
and the use of positional embeddings to capture spatial relationships in images.
We will review recent advancements in ViT models, focusing on innovations in their design and train-
ing methodologies, including self-supervised learning, hybrid architectures, and hierarchical approaches.
Furthermore, we will examine their applications across diverse domains such as image classification,
object detection, and semantic segmentation, as well as their performance on widely used benchmark
datasets. Despite their success, ViTs face challenges, including high data requirements and compu-
tational demands. We will discuss how these limitations are addressed through techniques such as
locality-enhancing mechanisms, efficient token processing, and integration with CNN-inspired features.
Lastly, we will delve into specific use cases, such as medical imaging and 3D object analysis, to
highlight ViTs’ practical impact and potential for further development. This comprehensive overview
aims to provide a deeper understanding of ViTs, their transformative capabilities, and their role in
shaping the future of computer vision research.
References
1. A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani,
M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An Image is Worth 16x16
Words: Transformers for Image Recognition at Scale,” arXiv preprint arXiv:2010.11929, 2020.
[Online].
2. Y. Khan, S. U. Rehman, J. Ahmad, Z. Jan, and A. Khan, “Vision Transformers: State of the Art
and Research Challenges,” arXiv preprint arXiv:2207.03041, 2022. [Online].
3. K. Han, Y. Wang, H. Chen, E. Wang, J. Guo, C. Tang, and Y. Xu, “Recent Advances in Vision
Transformer: A Survey and Outlook of Recent Work,” arXiv preprint arXiv:2111.06079, 2021.
[Online].