Self-Attention Mechanism in
Deep Learning
Definition, Architecture, Applications,
and How It Works
What is Self-Attention?
• Self-attention is a mechanism that allows a
model to weigh the importance of different
words in an input sequence relative to each
other.
• - Computes relationships between elements of
a sequence.
• - Enables context-aware understanding.
• - Fundamental to Transformer models.
Architecture of Self-Attention
• Key components:
• - Input Embeddings
• - Query, Key, and Value Vectors (Q, K, V)
• - Scaled Dot-Product Attention
• - Output Weighted Sum
• Often used in multi-head format to capture
diverse relationships.
How Does Self-Attention Work?
• 1. Generate Q, K, V vectors from input.
• 2. Calculate dot product of Q and K to get raw
attention scores.
• 3. Apply softmax to obtain attention weights.
• 4. Multiply weights with V to get the final
output.
• Formula: Attention(Q, K, V) = Softmax(QKᵀ /
√d_k) V
Applications of Self-Attention
• - Natural Language Processing (NLP): BERT,
GPT
• - Machine Translation and Summarization
• - Vision Transformers (ViT) in Computer Vision
• - Speech Recognition and Audio Processing
• - Protein Structure Prediction (e.g., AlphaFold)
Thank You
• Questions and Discussions Welcome!