ICS 603: Advanced Machine Learning
Lecture 9&10
Foundations of Text Representation, Basic Language
Models and Transformers
Dr. Caroline Sabty
[Link]@[Link]
Faculty of Informatics and Computer Science
German International University in Cairo
Acknowledgment
The course and the slides are based on the slides of UC Berkeley by Dr.
Sergey Karayev, Dr. Josh Tobin and Dr. Pieter Abbeel
Word Embeddings
What Does a Word Mean?
• Definition of meaning in dictionary:
• The idea that is represented by a word, phrase, etc.
• The idea that a person wants to express by using words,
signs, etc.
• The idea that is expressed in a work of writing, art, etc
• Words, Lemmas, Senses, Definition
How to Represent Meaning in a Computer?
• Created resources for lexical semantics e.g., WordNet
• Online lexical database, you may also call its
"electronic dictionary"
• Covers most English nouns, verbs, adjectives, adverbs
• It has hypernyms (is-a) relationships and synonym sets
• Free to download
What Are Some of the Disadvantages?
• Not available for all languages
• Missing new words
• Require human labor to create and adapt
• Hard to compute accurate word similarity
car, bicycle
cow, horse
Introduction to Word Embeddings
● Understanding Word Meaning:
○ Words derive meaning from the contexts in which they appear, encapsulating
the principle that
■ Words that occur in the same contexts tend to have similar meanings."
(Zellig Harris (1954)
■ "You shall know a word by the company it keeps" - J.R. Firth, 1957
Types of Vector Representations
● Purpose of Vectors in NLP: Vectors are fundamental for transforming textual data
into numerical form that can be processed by machine learning algorithms.
● Sparse vs. Dense Vectors:
● Sparse Vectors: Typically high-dimensional, with most elements being zero.
They are easy to compute but inefficient in capturing semantic depth.
● Dense Vectors: Low-dimensional but densely populated with non-zero values.
They are computationally intensive but capture rich semantic information.
Sparse Vectors - Basic Techniques
● One-Hot Encoding:
● Represents each word in the vocabulary as a vector where one element is '1',
and all other elements are '0'. Each vector is as long as the vocabulary.
● Limitations:
i. Does not capture any semantic information; every word is equidistant
from every other word, making it unsuitable for most NLP tasks beyond
simple retrieval.
ii. Scales poorly with vocabulary size
iii. Very high-dimensional sparse vectors -> NN operations work poorly
iv. Violates what we know about word similarity (e.g. "run" is as far away
from "running" as from "tertiary," or "poetry")
One-hot Encoding Example
Sparse Vectors - Advanced Techniques
● Co-occurrence Vectors:
● Measures how often each word appears within a certain distance of every other
word in a large text corpus.
● Advantages: Begins to account for the context by noting which words appear
near each other, offering more insight than one-hot encoding.
● TF-IDF (Term Frequency-Inverse Document Frequency):
● Adjusts the frequency of words by how often they appear across all documents,
reducing the weight of commonly used words across the corpus.
● Advantages: Highlights words that are distinctive to particular documents,
which is especially useful in search engines and information retrieval.
Embedding Matrix
VxV matrix VxE VxE matrix
embedding
matrix
• Problem: how do we find the values of the embedding matrix?
[Link]
Dense Vectors - Classical Word Embeddings
● Dense vectors are low-dimensional and densely populated with non-zero values.
Unlike sparse vectors, they are capable of capturing complex patterns and
semantic relationships between words.
● Efficiency and Semantic Capture: Dense vectors, while computationally more
intensive than sparse vectors, efficiently represent semantic meanings in a
compact form, facilitating faster and more effective machine learning processes.
● Classical Models like Word2Vec and GloVe
Word2Vec (2013)
"king"
[Link]
Full Stack Deep Learning - UC Berkeley
Spring 2021
● Word2Vec (Google, 2013):
■ Utilizes two architectures; Continuous Bag of Words (CBOW) and Skip-Gram:
● CBOW predicts a target word from a window of surrounding context
words
● Skip-Gram does the opposite, predicting context words from a target
word.
Representing Words by their Context
• When a word w appears in a text, its context is the set of words that
appear nearby (within a fixed-size window)
• Use the many contexts of w to build up a representation of w
Word2Vec (2013)
[Link]
Full Stack Deep Learning - UC Berkeley
Spring 2021
Using Vector Math
[Link]
Word2Vec (2013)
[Link]
Full Stack Deep Learning - UC Berkeley
Spring 2021
Word2Vec (2013)
[Link]
Full Stack Deep Learning - UC Berkeley
Spring 2021
Word2Vec (2013)
[Link]
Beyond Embeddings
• Word2Vec and GloVe (another type) embeddings became popular in ~2013-14
• Boosted accuracy on many tasks by low single-digit %
• Enhanced performance in tasks like text classification, sentiment analysis, and machine
translation by embedding words into a continuous vector space where semantically
similar words are mapped to nearby points.
• Some Disadvantages:
• Inability to handle unknown or OOV words
• Word2vec and GloVe are classic word embedding techniques (static word
embeddings): the same word will always have the same representation regardless of
the context where it occurs
Dense Vectors - Contextual Word Embeddings
● Contextual word embeddings are a type of dense vector that dynamically encode words
based on the context in which they appear. Each usage of a word can have a different
representation, capturing nuances like polysemy and syntactic variation.
● Why Context Matters: Traditional word embeddings provide a single representation per
word, which fails to capture the variations in meaning that arise from different contexts.
● Contextual embeddings address this limitation by generating word representations that
adapt according to their textual surroundings.
● Examples: ELMo and BERT
Language Models
From Embeddings to Language Models
● Word embeddings convert words into numerical vectors.
● Language models use these vectors to understand and predict language sequences.
● By inputting these embeddings, language models can process and generate human-like text,
predicting what word comes next in a sentence.
● Definition of Language Models: Language models are statistical or neural network-based
tools that predict the next word in a sequence by learning the probabilities of word
sequences. They are essential for applications like text generation, speech recognition, and
machine translation.
● Purpose in NLP: Language models form the backbone of many NLP tasks. They not only
predict text but also help in understanding the context and generating language that is
syntactically and semantically correct.
Solution 2: Learn a Language Model
• "Pre-train" for your NLP task by learning a really good word embedding!
• How to learn a really good embedding? Train for a very general task on
a large corpus of text.
+ ?
Language Model Training
• Words get their embeddings by us looking at which other words they tend to
appear next to:
■ We get a lot of text data (e.g., all Wikipedia articles)
■ We have a window (e.g., three words) that we slide against all of that text
■ The sliding window generates training samples for our model
N-Grams: The Foundation of Language Models
● An n-gram model predicts the probability of a word based on the previous n−1 words. For
instance, a bigram model (an n-gram where n=2) predicts the next word based only on the
immediately preceding word.
● Limitations:
● Data Sparsity: The probability estimates of n-grams heavily rely on the frequency of
occurrences in the training dataset. Rare combinations are poorly represented, leading
to unreliable predictions.
● Context Limitation: The context considered by n-grams is fixed to n−1 words, which can
ignore important linguistic context outside this window.
● Storage and Scalability: The storage requirement grows exponentially with the size of n,
making large n-grams computationally expensive and less feasible for large datasets.
N-Grams
• Slide an N-sized window through the text, forming a dataset of
predicting the last word.
[Link]
N-Grams
• Slide an N-sized window through the text, forming a dataset of
predicting the last word.
[Link]
N-Grams
• Slide an N-sized window through the text, forming a dataset of
predicting the last word.
[Link]
Skip-grams
• Look on both sides of the target word, and form multiple samples from each N-gram
[Link]
Learn a Language Model
[Link]
Speed Up Training
• Binary instead of multi-class (predicting the next word): faster training
[Link]
Applying RNNs to Language Modeling
• Remember that RNNs handle sequential data, maintaining hidden states that capture information
from previous inputs.
• This feature makes them suitable for predicting elements in sequences, like words in sentences.
Applying RNNs to Language Modeling
• Remember that RNNs handle sequential data, maintaining hidden states that capture information
from previous inputs.
• This feature makes them suitable for predicting elements in sequences, like words in sentences.
• RNNs in Language Modeling:
○ Model Architecture: RNN being used to predict the next word in a sequence. The flow
where each input word (as an embedding) updates the hidden state, and the output is the
probability distribution over the vocabulary for the next word.
○ Process Description: the process as a loop where at each timestep, the RNN reads a word,
updates its state, and outputs a prediction. The state carries forward to influence the
prediction at the next step, allowing the network to consider all previous context implicitly.
Applying RNNs to Language Modeling
• Remember that RNNs handle sequential data, maintaining hidden states that capture information
from previous inputs.
• This feature makes them suitable for predicting elements in sequences, like words in sentences.
• RNNs in Language Modeling:
○ Model Architecture: RNN being used to predict the next word in a sequence. The flow
where each input word (as an embedding) updates the hidden state, and the output is the
probability distribution over the vocabulary for the next word.
○ Process Description: the process as a loop where at each timestep, the RNN reads a word,
updates its state, and outputs a prediction. The state carries forward to influence the
prediction at the next step, allowing the network to consider all previous context implicitly.
• Advantages Over Static Models: unlike n-grams, RNNs do not require the predefined context
window and can theoretically capture long-range dependencies.
• Example: Text Generation: RNN trained on large text corpora can generate coherent new text
sequences.
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
• ELMo as a Language Model:
○ Bidirectional Language Modeling: Trains on predicting next words from previous context in
both directions.
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
• ELMo as a Language Model:
○ Bidirectional Language Modeling: Trains on predicting next words from previous context in
both directions.
• ELMo as an Embedding Technique:
○ Dynamic Embeddings: Computes embeddings on the fly, tailored to word usage in specific
textual contexts.
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
• ELMo as a Language Model:
○ Bidirectional Language Modeling: Trains on predicting next words from previous context in
both directions.
• ELMo as an Embedding Technique:
○ Dynamic Embeddings: Computes embeddings on the fly, tailored to word usage in specific
textual contexts.
○ Example: Different embeddings for "bank" in "river bank" vs. "bank account."
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
• ELMo as a Language Model:
○ Bidirectional Language Modeling: Trains on predicting next words from previous context in
both directions.
• ELMo as an Embedding Technique:
○ Dynamic Embeddings: Computes embeddings on the fly, tailored to word usage in specific
textual contexts.
○ Example: Different embeddings for "bank" in "river bank" vs. "bank account."
○ Layer Combination: Integrates outputs from multiple BiLSTM layers.
Embeddings from Language Model (ELMo 2018)
• ELMo uses a deep bidirectional LSTM to generate dynamic word vectors.
• Unlike traditional embeddings which assign a fixed vector to each word, ELMo
looks at the entire sentence to determine each word's meaning
• Analyzes entire sentences to produce context-dependent word meanings
• ELMo as a Language Model:
○ Bidirectional Language Modeling: Trains on predicting next words from previous context in
both directions.
• ELMo as an Embedding Technique:
○ Dynamic Embeddings: Computes embeddings on the fly, tailored to word usage in specific
textual contexts.
○ Example: Different embeddings for "bank" in "river bank" vs. "bank account."
○ Layer Combination: Integrates outputs from multiple BiLSTM layers.
○ Task-Specific Tuning: Weights layers differently depending on the task to optimize
performance, focusing on relevant linguistic features (syntax, semantics).
Embeddings from Language Model (ELMo 2018)
• Learns contextualized word representations based on a neural language model with a
character-based encoding layer and two BiLSTM layers.
[Link]
State-of-the-art Performance on Well-known Tasks
[Link]
Transformers
[Link]
Introduction to Transformers
● Limitations of Prior Models:
○ Recap Limitations of RNNs and LSTMs: such as the difficulty in parallelizing the
computations and the challenges in handling very long-range dependencies.
Introduction to Transformers
● Limitations of Prior Models:
○ Recap Limitations of RNNs and LSTMs: such as the difficulty in parallelizing the
computations and the challenges in handling very long-range dependencies.
○ Contextual Embeddings: While ELMo introduced dynamic, context-sensitive
embeddings, it still relies on sequential processing, which can be computationally
intensive and slow for longer texts.
Rise of Transformers
The Self-Attention Mechanism
• Self-attention, a key innovation in Transformers, allows each word in a sentence to process
information from every other word in the sentence simultaneously.
The Self-Attention Mechanism
• Self-attention, a key innovation in Transformers, allows each word in a sentence to process
information from every other word in the sentence simultaneously.
• Basic attention mechanisms allow models to focus on different parts of the input
sequence when performing a task, mimicking how humans pay attention to relevant parts
of what they see or hear to make decisions.
The Self-Attention Mechanism
• Self-attention, a key innovation in Transformers, allows each word in a sentence to process
information from every other word in the sentence simultaneously.
• Basic attention mechanisms allow models to focus on different parts of the input
sequence when performing a task, mimicking how humans pay attention to relevant parts
of what they see or hear to make decisions.
• Attention improves model performance by dynamically selecting a subset of the available
information based on what is most relevant to the current context or task.
Basic Self-attention
• Input: sequence of tensors
Basic self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input
sequence
Where j indexes over the whole
sequence and the weights sum to one
over all j
Basic self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input
sequence
Where j indexes over the whole
sequence and the weights sum to one
over all j
- not a learned weight, but a function of x_i and x_j
Note that 𝐱i is the input
vector at the same position
as the current output vector
𝐲i
Basic Self-attention
• Input: sequence of tensors
• Output: sequence of tensors, each one a weighted sum of the input
sequence
Where j indexes over the whole
sequence and the weights sum to one
over all j
- not a learned weight, but a function of x_i and x_j
- The dot product gives us a value anywhere between
negative and positive infinity, so we apply a softmax to
map the values to [0,1] as it must sum to 1 over j
(normalization of scores to probabilities).
Basic Self-attention Illustration
The Cat is yawning
[Link]
Basic Self-attention
• SO FAR:
• No learned weights
• Order of the sequence does not affect result of
computations
[Link]
Basic Self-attention
• SO FAR:
• No learned weights Let's learn some weights!
• Order of the sequence does not affect result of
computations
Advanced Attention in Transformers: Query, Key, Value
• Every input vector x_i is used in 3 ways:
• Query: Compared to every other vector to
compute attention weights for its own
output y_i (Represents the element for
which we are trying to compute attention.)
Advanced Attention in Transformers: Query, Key, Value
• Every input vector x_i is used in 3 ways:
• Query: Compared to every other vector to
compute attention weights for its own
output y_i (Represents the element for
which we are trying to compute attention)
• Key: Compared to every other vector to
compute attention weight w_ij for output
y_j (Represents the elements that we
compare against to determine the amount
of attention)
Advanced Attention in Transformers: Query, Key, Value
• Every input vector x_i is used in 3 ways:
• Query: Compared to every other vector to
compute attention weights for its own
output y_i (Represents the element for
which we are trying to compute attention.)
• Key: Compared to every other vector to
compute attention weight w_ij for output
y_j (Represents the elements that we
compare against to determine the amount
of attention.)
• Value: Summed with other vectors to form
the result of the attention weighted sum
Transformer Attention
Attention module has three inputs:
• Keys 𝐊, Values 𝐕 and Queries 𝐐
• Computes the dot product of Q and K to derive raw
attention scores, indicating focus levels across the input
sequence.
Top: Scaled Dot- Product attention.
Bottom: Multi- Head attention
[VAS2017].
Transformer Attention
Attention module has three inputs:
• Keys 𝐊, Values 𝐕 and Queries 𝐐
• Computes the dot product of Q and K to derive raw
attention scores, indicating focus levels across the input
sequence.
• The raw scores are scaled down by the square root of
the dimension of the key vectors
Top: Scaled Dot- Product attention.
Bottom: Multi- Head attention
[VAS2017].
Transformer Attention
Attention module has three inputs:
• Keys 𝐊, Values 𝐕 and Queries 𝐐
• Computes the dot product of Q and K to derive raw
attention scores, indicating focus levels across the input
sequence.
• The raw scores are scaled down by the square root of
the dimension of the key vectors
• Scaling stabilizes training gradients by preventing large
values from flattening the softmax response.
Top: Scaled Dot- Product attention.
Bottom: Multi- Head attention
[VAS2017].
Transformer Attention
Attention module has three inputs:
• Keys 𝐊, Values 𝐕 and Queries 𝐐
• Computes the dot product of Q and K to derive raw
attention scores, indicating focus levels across the input
sequence.
• The raw scores are scaled down by the square root of
the dimension of the key vectors
• Scaling stabilizes training gradients by preventing large
values from flattening the softmax response.
• The scaled dot-Product attention the output matrix:
𝐐𝐊𝑇��
𝐀(Q,K,V) = 𝜎
k
Top: Scaled Dot- Product attention.
Bottom: Multi- Head attention
[VAS2017].
Transformer Attention
Attention module has three inputs:
• Keys 𝐊, Values 𝐕 and Queries 𝐐
• Computes the dot product of Q and K to derive raw
attention scores, indicating focus levels across the input
sequence.
• The raw scores are scaled down by the square root of
the dimension of the key vectors
• Scaling stabilizes training gradients by preventing large
values from flattening the softmax response.
• The scaled dot-Product attention the output matrix:
𝐐𝐊𝑇��
𝐀(Q,K,V) = 𝜎
k
• Multi-head attention is applied multiple times in Top: Scaled Dot- Product attention.
parallel on linearly projected versions of 𝐕, 𝐊, 𝐐. Bottom: Multi- Head attention
[VAS2017].
Query, Key, Value
• We can process each input vector to fulfill the three roles with matrix
multiplication
• Learning the matrices --> learning attention
Multi-head attention
• Multiple "heads" of attention just means learning different sets of
W_q, W_k, and W_v matrices simultaneously.
• Implemented as just a single matrix
Attention is all you need (2017)
• Encoder-decoder with only attention and
fully-connected layers (no recurrence or
convolutions)
Attention is all you need (2017)
• Encoder-decoder with only attention and
fully-connected layers (no recurrence or
convolutions)
• When proposed, it set new State-of-the-Art
(SOTA) on translation datasets.
Attention is all you need (2017)
• Encoder-decoder with only attention and
fully-connected layers (no recurrence or
convolutions)
• When proposed, it set new State-of-the-Art
(SOTA) on translation datasets.
• In the translation task the job of the Encoder is
to create an attention map for the sentences in
the source language and the job of the Decoder
is to use that attention map for translating the
source-language sentence into a target-language
sentence.
Transformer Encoder
• For simplicity, can focus just on the Encoder
• E.g. BERT is just the encoder
• The encoder in a Transformer processes the input
data by converting the entire sequence—like a
sentence or series of events—into a set of vectors.
Each vector represents a segment of the input,
enriched with contextual information from the
entire sequence.
Transformer Encoder
• The components:
• (Masked) Self-attention
• Positional encoding
• Layer normalization
Transformer Encoder
The encoder processes the input sequence using
two sub-layers [VAS2017]:
• Multi-head self-attention mechanism: It
attends to different parts of the sequence in
parallel, inferring meaning and context.
• Position-wise fully connected feed- forward
network: Two linear transformations with a RELU
activation in between applied to each position
Transformer encoder [VAS2017]
independently.
Transformer Decoder
• The decoder generates output from encoded
data, like translating a sentence into another
language. It uses the encoded vectors and its
previous outputs to produce each new element
of the sequence, ensuring the output is
coherent and contextually relevant.
• The decoder has an extra multi-head
cross-attention sub-layer of attention, between
the two sub-layers of the encoder layer.
• It outputs the probability of each vocabulary
token. Transformer
• It key-value pairs 𝐊, 𝐕, are obtained from the decoder.
encoder output.
7
Back to the Architecture of the Transformer
• Self-attention layer -> Layer normalization -> Dense layer
[Link]
Layer Normalization
• Neural net layers work best when
input vectors have uniform mean
and std in each dimension
[Link]
Layer Normalization
• Neural net layers work best when
input vectors have uniform mean
and std in each dimension
• As inputs flow through the
network, means and std's get
blown out
[Link]
Layer Normalization
• Neural net layers work best when
input vectors have uniform mean
and std in each dimension
• As inputs flow through the
network, means and std's get
blown out
• Layer Normalization is a hack to
reset things to where we want them
in between layers
[Link]
Transformer
• SO FAR:
• Learned query, key, value weights
• Multiple heads
• Order of the sequence does not affect result of computations
Transformer
• SO FAR:
• Learned query, key, value weights
• Multiple heads
• Order of the sequence does not affect result of computations
Let's encode each vector with
position
Transformer: Position embedding
[Link]
Transformer: Position embedding
• Position embedding: just what it sounds!
[Link]
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector
in sequence (e.g. generate text), we need to mask the future.
Transformer: last trick
• Since the Transformer sees all inputs at once, to predict next vector in sequence (e.g. generate
text), we need to mask the future.
• Self-Attention with masking in Transformers involves modifying the attention score matrix.
• Matrix Type: A triangular matrix is used, specifically a lower triangular matrix when generating
English text left to right.
• Lower Triangular Part: Includes positions for current and past tokens with standard attention
scores, allowing these tokens to influence the prediction.
• Upper Triangular Part: Set to negative infinity, effectively masking future tokens to prevent them
from influencing the current position's output before applying the softmax function.
Attention is all you need (2017)
• Encoder-decoder were used for
translation
• Later models made it mostly
just the encoder or just the
decoder
• ...but then the latest models
are back to encoder-decoder