Attention Is All You Need
A New Simple Network Architecture for Sequence Transduction
Presentation for ML Researchers and Engineers
The Old World: Recurrent and Convolutional Models
Before the Transformer, Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs) were the standard for sequence tasks. However, they presented fundamental
limitations in speed and memory.
RNNs: Sequential Processing
Process sequences token by token, inheriting state from the previous step. Excellent for
modeling short dependencies, but fundamentally unparallelizable and slow to train.
The Vanishing Gradient Problem
RNNs struggle to maintain information flow across long distances, making it difficult to
capture long-range dependencies in text sequences.
CNNs: Fixed Receptive Field
While faster and more parallelizable, CNNs in NLP capture only local patterns
effectively. Modeling distant relationships requires deep, multi-layered stacks.
The core pain point: Slow training due to sequential processing and limited ability to model long-term context.
The Interim Solution: Gated
Architectures
Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) were critical
innovations that attempted to solve the RNN's memory problems, setting the stage for
the next breakthrough.
LSTM: Gating Mechanisms
Introduced specialized 'gates' (Forget, Input, Output) to explicitly control the flow
of information into and out of the cell state, enabling the model to remember
long-term context better.
GRU: Simplified Gates
A more streamlined design than LSTM, combining the forget and input gates into
a single 'update gate' and merging the cell state and hidden state. Offers similar
performance with fewer parameters.
While LSTMs and GRUs significantly improved long-term memory, they remained sequential models, still
bound by the parallelization bottleneck during training.
The Transformer Revolution
Attention Is All You Need
The 2017 paper introduced the Transformer, a radical architecture that dispenses entirely with recurrence and
convolutions, relying solely on attention mechanisms.
Fully Parallelizable Long-Range Context Scalability
Since sequence steps are processed Attention allows every word to directly The architecture is highly scalable,
simultaneously, training time is interact with every other word, easily serving as the foundation for modern
dramatically reduced. capturing dependencies regardless of large language models.
distance.
The Transformer Architecture Overview
The Transformer maintains the standard encoder-decoder structure but replaces sequential processing blocks with stacked self-
attention and position-wise feed-forward layers.
The Encoder maps an input sequence of symbol representations to a The Decoder takes the encoder's output and generates
sequence of continuous representations. the output sequence one symbol at a time (auto-
regressively).
Key structural enhancements include residual connections around each sub-layer followed by layer normalization.
Core Innovation: The Attention Function
Attention maps a query and a set of key-value pairs to an output. This output is computed as a weighted sum of the values, where
the weight assigned to each value is determined by the compatibility function of the query with the corresponding key.
1 2 3
Query (Q) Key (K) Value (V)
The element currently being A label or descriptor for all other The actual informational content of
processed (e.g., the current word elements in the sequence. all other elements.
looking for context).
QK T
Attention(Q, K, V ) = sof tmax( )V
dk
The Transformer uses Scaled Dot-Product Attention due to its efficiency and the fact that matrix multiplication is highly
optimized.
Key Architectural Components
Within the encoder and decoder, three unique layers define the Transformer's power.
Multi-Head Attention Position-wise FFN
Allows the model to jointly attend to A fully connected, feed-forward
information from different network applied independently and
representation subspaces at identically to each position,
different positions, significantly consisting of two linear
increasing representational power. transformations with a ReLU
activation in between.
Positional Encoding
Since the model contains no recurrence or convolution, this is crucial for
injecting information about the relative or absolute position of the tokens in
the sequence.
The Authors of Innovation
The landmark paper was primarily authored by a team of Google researchers who introduced the simple yet revolutionary design that redefined the field of
sequence modeling.
Core innovators Ashish Vaswani and Noam Shazeer were key drivers, alongside contributing authors from Google Brain, Google Research, and the University of Toronto.
Groundbreaking Performance Benchmarks
The Transformer immediately set a new standard for performance and efficiency in machine translation tasks, proving the
superiority of the attention-only approach.
28.4 41.0 3.5X
English-to-German BLEU English-to-French BLEU Faster Training
Achieved a 28.4 BLEU score on WMT Established a new single-model state-of- The model required significantly less
2014, an improvement of over 2 BLEU the-art BLEU score of 41.0 on WMT time to train, achieving a fraction of the
points over the best previously published 2014. training cost compared to competitive
ensemble model. recurrent or convolutional models.
The speed and quality benefits proved that attention is indeed all you need.
The Modern ML Landscape:
Transformer Ecosystem
The Transformer architecture became the foundational block for virtually all subsequent
breakthroughs in NLP, creating a rapid explosion in model capability and scale.
Transformer GPT-2
Introduced self- Decoder-only, large-
attention and encoder- scale generative
decoder blocks pretraining
BERT
T5
Encoder-only,
Unified text-to-text
bidirectional
with task framing
pretraining
BERT (2018): Focus on the Encoder for deep, bi-directional understanding.
GPT (2018-present): Focus on the Decoder for powerful, auto-regressive generation.
T5 (2020): Treat all NLP problems as a Text-to-Text task using the full Encoder-Decoder
structure.