Master in Generative AI
Gen AI Models III
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Purpose
The purpose of this section is to give you
an in-depth view of Generative AI Models
At the end of this lecture, you will learn the
following
• Generative Adversarial Networks (GANs)
• Variational Autoencoders (VAEs)
• Transformer Models
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Types of Generative Models
Generative
Adversarial
Transformer
Networks
Models
(GANs)
Variational
Autoencoders
(VAEs)
Enrichmentors Growing through Excellence over 40 years to become Best in Management
What are Transformer Models?
Such as GPT (Generative Pre-trained
Transformer) models, which are
particularly effective for generating
human-like text based on the input
they receive. Transformer models are a
type of deep learning architecture
designed for handling sequential data,
particularly useful in natural language
processing (NLP) tasks. Introduced by
Vaswani et al. in the 2017 paper
"Attention is All You Need,"
transformers have revolutionized the
field of NLP by enabling more efficient
and effective processing of text
compared to previous architectures like
recurrent neural networks (RNNs) and
long short-term memory networks
(LSTMs). Here’s an in-depth look at
transformer models
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Core Components of Transformer Models
Self-Attention Positional 1.Self-Attention
Mechanism Encoding Mechanism
2.Positional Encoding
3.Multi-Head
Multi-Head Feed-Forward Attention
Attention Neural Networks 4.Feed-Forward Neural
Networks
5.Layer Normalization
Layer and Residual
Normalization and
Residual Connections
Connections
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Self-Attention Mechanism
1.Self-
Scaled Attention
Self- Dot- 2.Scaled
Attention Product Dot-
Product
Attention
Attention
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Self-Attention
This mechanism allows the model to weigh the importance of different words in a
sentence relative to each other, regardless of their position. It computes attention
scores for all pairs of words in a sequence, enabling the model to focus on relevant
parts of the input for each token
Smaller parts a sequence of text .
These tokens can be as small as
characters or as long as words.
numeric representations of words
in a lower-dimensional space,
capturing semantic and syntactic
information
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Self-Attention
Query, Key, Attention •Query, Key, and Value: For a given
and Value Scores word, the self-attention mechanism
computes three vectors: Query (Q),
For a given word, the Key (K), and Value (V). These
self-attention : The model calculates
mechanism computes
three vectors
attention scores vectors are learned during training.
• The query is the information that
is being looked for, the key is the
Query (Q)
Taking the dot
product of context or reference, and the value
is the content that is being
searched.
Key (K)
Query vector for the
current word
•Attention Scores: The model
calculates attention scores by
taking the dot product of the Query
Value (V)
Key vectors for all the
words in the input
vector for the current word and the
sequence Key vectors for all the words in the
input sequence. These scores
These vectors are These scores indicate indicate how much focus each word
learned during how much focus each
training. word should receive should receive
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Scaled Dot-Product Attention
It calculates attention
Self Attention scores using the dot
product of query and
Scaled key vectors, scaled by
the square root of
Dot- the dimension of the
Product key vectors. These
scores are then
Scaling by the
Attention passed through a
square root of
the dimension softmax function to
of the key
vectors obtain attention
weights
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Core Components of Transformer Models
Self-Attention Positional
Mechanism Encoding
Multi-Head Feed-Forward
Attention Neural Networks
Layer
Normalization and
Residual
Connections
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Positional Encoding
Since transformers do not
process sequences in order,
positional encodings are
added to the input
embeddings to provide
information about the
position of each token in
the sequence. These
encodings are combined
with the input embeddings
to retain the sequential
order information
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Multi-Head Attention
This component allows the
model to jointly attend to
information from different
representation subspaces
at different positions. It
consists of multiple self-
attention layers (heads)
running in parallel, whose
outputs are linked
together in series and
linearly transformed
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Feed-Forward Neural Networks
Each position in the
sequence is
processed
independently using
fully connected feed-
forward networks.
These networks are
applied after the
multi-head attention
mechanism
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Layer Normalization and Residual Connections
1. Layer normalization stabilizes and
Residual accelerates training by normalizing
Layer normalization connections (skip the inputs to each layer. It ensures
that the inputs have a consistent
connections) distribution and reduces the internal
covariate shift problem that can
occur during training
2. Residual connections (skip
Help in training deeper networks by allowing
Stabilizes and accelerates training by
gradients to flow through the network connections) help in training deeper
normalizing the inputs to each layer.
without vanishing networks by allowing gradients to
flow through the network without
vanishing. Gradient simply measures
the change in all weights with regard
Gradient simply measures the change in all to the change in error. You can also
weights with regard to the change in error.
It ensures that the inputs have a consistent You can also think of a gradient as the slope think of a gradient as the slope of a
distribution and reduces the internal of a function. The higher the gradient, the
covariate shift problem that can occur during steeper the slope and the faster a model can
function. The higher the gradient,
training learn. But if the slope is zero, the model stops the steeper the slope and the faster
learning
a model can learn. But if the slope is
zero, the model stops learning
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Architecture
The transformer
architecture
consists of an
encoder and a
decoder:
1.Encoder
2.Decoder
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Encoder
1. Composed of multiple
identical layers, each
A set of attention- containing two main
weighted representations components: a multi-head
self-attention mechanism
and a position-wise fully
connected feed-forward
Multiple
Identical network.
layers 2. The encoder processes the
input sequence and
generates a set of
attention-weighted
representations
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Decoder
1. Similar to the encoder, the
decoder also consists of
multiple identical layers. Each
A set of attention- layer includes a multi-head self-
weighted representations
attention mechanism, an
encoder-decoder attention
Multiple mechanism (which attends to
Identical the encoder's output), and a
Multiple layers feed-forward network.
Identical 2. The decoder generates the
layers output sequence one token at a
time, attending to the encoder's
output and previously
generated tokens
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Applications of Transformer Models
1.Language Modeling:
Models like GPT (Generative Pre-trained
Language Machine Text Question Sentiment Transformer) are used to predict the next
word in a sentence, enabling text
Modeling: Translation: Summarization: Answering: Analysis:
generation, auto-completion, and more.
2.Machine Translation:
Transformers are used in models like BERT
(Bidirectional Encoder Representations
from Transformers) and T5 (Text-To-Text
Transfer Transformer) to translate text from
Transformers are one language to another.
Models like BERT
used in models 3.Text Summarization:
Models like GPT and RoBERTa
like BERT
(Generative Pre-
(Bidirectional
(Robustly Transformers can generate concise
trained Optimized BERT summaries of long documents, extracting
Encoder Transformers can Transformers
Transformer) are Pretraining
used to predict
Representations generate concise
Approach) are
analyze the key information effectively.
from summaries of sentiment of 4.Question Answering:
the next word in employed in
Transformers) long documents, text, classifying
a sentence,
and T5 (Text-To- extracting key
question-
it as positive, Models like BERT and RoBERTa (Robustly
enabling text answering Optimized BERT Pretraining Approach) are
Text Transfer information negative, or
generation, systems,
Transformer) to effectively. neutral employed in question-answering systems,
auto- understanding
translate text understanding context to provide accurate
completion, and context to
from one
more. provide accurate answers.
language to
answers. 5.Sentiment Analysis:
another.
Transformers analyze the sentiment of text,
classifying it as positive, negative, or
neutral
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Advantages and Limitations of Transformer Models
1.Parallelization:
Recurrent Neural Unlike RNNs and LSTMs, transformers do not
Network(RNN) is a type process data sequentially, allowing for parallel
Challenges and
Advantages of Neural Network where the computation and faster training times.
Limitations output from the previous step 2.Long-Range Dependencies:
The self-attention mechanism effectively
is fed as input to the current captures long-range dependencies in the data,
step improving the model's ability to understand
context.
Scalability Interpretability 3.Scalability:
Transformers scale efficiently with data and
Long short-term memory (LSTM) computational resources, enabling the creation
is a type of recurrent neural of large models like GPT-3 with billions of
Long-Range Data network (RNN) aimed at dealing parameters.
Dependencies Requirements with the vanishing gradient Challenges and Limitations
problem present in traditional 1.Computational Resources:
RNNs. Its relative insensitivity to Training large transformer models requires
Computational significant computational power and memory.
Parallelization: gap length is its advantage over
Resources other RNNs, hidden Markov
2.Data Requirements:
Transformers often require large amounts of
models and other sequence training data to achieve high performance.
learning methods 3.Interpretability:
The complexity of transformer models can
make them difficult to interpret and
understand.
Enrichmentors Growing through Excellence over 40 years to become Best in Management
What are Transformer Models?
Conclusion
Transformer models have
transformed the landscape
of natural language
processing by enabling more
efficient and effective
handling of sequential data.
Their ability to capture long-
range dependencies and
process data in parallel has
led to significant
advancements in various NLP
tasks, making them a
cornerstone of modern AI
research and applications
Enrichmentors Growing through Excellence over 40 years to become Best in Management
What is next?
Model Development and Optimization
Enrichmentors Growing through Excellence over 40 years to become Best in Management
Master in Generative AI
Gen AI Models III
Enrichmentors Growing through Excellence over 40 years to become Best in Management