0% found this document useful (0 votes)

128 views68 pages

Advanced NLP: LSTM & GRU Explained

The document discusses advanced concepts in Natural Language Processing, focusing on Recurrent Neural Networks (RNNs), Long Short-Term Memory (LSTM), and Gated Recurrent Units (GRU). It highlights the strengths and weaknesses of RNNs, particularly in handling long-term dependencies and the vanishing gradient problem, while introducing LSTMs and GRUs as solutions with enhanced memory capabilities. The course is intended for the academic year 2024/2025.

Uploaded by

hayssouss127927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

128 views68 pages

Advanced NLP: LSTM & GRU Explained

Uploaded by

hayssouss127927

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Course: Advanced Natural Language Processing

Beyond RNN
LSTM & GRU
University Year: 2024/2025
Recall on Vanilla RNN

Input Data

Inference

I/O Mapping

Loss Function

Training

2
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Order

Dependencies

3
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Order

Dependencies

If the data elements are simply ordered (e.g. by size, arrival time),
this does not imply the data is sequential; we need inter-elements
semantic/contextual relationships (e.g. to make predictions)
4
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)

Are these relationships necessary for task modelling ?

No Yes

Non-sequential models (e.g. MLP) Sequential models (e.g. RNN)

Whether or not the data is sequential, the If the data is sequential, the model will extract
model will not leverage the data dependencies the relationships between data elements

The past has no impact on the present The past influences the present

5
Recall on Vanilla RNN: Inference
RNNs maintain a memory (hidden state) of previous inputs

RNN vs MLP

Same matrices
for all inputs!

Seminal Work: Rumelhart et al. “Learning internal representations 6

by error propagation”, Tech. rep. ICS 8504. 1985
Recall on Vanilla RNN: I/O Mapping

7
Use-case Examples

8
Use-case Examples

One-Hidden
Layer NN

9
Use-case Examples

One-Hidden Text
Layer NN Generation

10
Use-case Examples

One-Hidden Text Sentiment

Layer NN Generation Analysis

11
Use-case Examples

One-Hidden Text Sentiment

Text Translation
Layer NN Generation Analysis

12
Use-case Examples

One-Hidden Text Sentiment POS Tagging,

Text Translation
Layer NN Generation Analysis NER

13
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary

Example: Part-of-Speech Tagging

Sentence: Tensorflow is very easy

POS Tags: NOUN VERB ADJ ADV

Predictions are thus a distribution over the set

of unique tags/classes composed of
{NOUN, VERB, ADJ, ADV}

Classification task per timestep

14
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary

Loss for one timestep

Loss for all timesteps

(rather average)

Perplexity
(the lower the perplexity, the more
confident the next word prediction)
15
Recall on Vanilla RNN: Training
RNNs are trained using backpropagation through time (BPTT)

16
Assessing Vanilla RNN
Pros
● Current state uses information from
earlier steps
● RNNs process input sequences of
any length
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps

17
Assessing Vanilla RNN
Pros Cons
● Current state uses information from ● RNNs are sequential thus cannot
earlier steps be parallelized
● RNNs process input sequences of ● Long-term dependencies (i.e.
any length information from many steps back)
are hardly captured
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps

18
Vanilla RNN Issues
Cons
● RNNs are sequential thus cannot
be parallelized
○ Transformers (Vaswani et al., 2017)
[Next lecture]

○ Minimal LSTM/GRU (Feng et al., 2024)

● Long-term dependencies (i.e.

information from many steps back)
are hardly captured

Transformer Architecture 19
Vanilla RNN Issues Red Row Sequential input data

Green Row Unrolled RNN

Cons Blue Row Output along sequence

● RNNs are sequential thus cannot be

parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]

Syntactics
Stack RNN cells for more
memory capacity

RNNs as general-purpose computers

(Turing complete) bottom-up, then left to right 20
Vanilla RNN Issues Red Row Sequential input data

Green Row Unrolled RNN

Cons Blue Row Output along sequence

● RNNs are sequential thus cannot be

parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]

Syntactics

Still The Problem Persists!

21
Let’s Analyse…
RNNs are trained using backpropagation through time (BPTT)

22
‘T’ outputs, thus ‘T’ error terms

23
‘t’ timesteps, thus ‘t’ derivatives

24
Chain Rule (Recall)

25
26
27
28
29
Chain Rule (Again!)

30
31
NUMER
IC
INSTAB AL
ILITY

32
EVEN W
ORSE

Too sensitive!

33
Vanishing and Exploding Gradients

Vanishing Gradients: gradients Exploding Gradients: gradients

become extremely small become excessively large

● negligible update of weights ● large update of weights

● slow training ● instability and divergence
● harder to detect ● easy to detect

34
Vanishing and Exploding Gradients Problem

Vanishing Gradients Exploding Gradients

35
Vanishing and Exploding Gradients Problem

Suppose all gradients

are upper bounded by
“c”

The input sequence length

can be represented by
“x = t - k”

Original Study: Pascanu et al., “On the difficulty of

training Recurrent Neural Networks”, ICML, 2013 36
Exploding Gradients are NOT ALWAYS a Problem

No Effort At All <> Small Consistent Effort

(Just Saying!)
37
Vanishing and Exploding Gradients Problem

Quick Question

Why do we make a case distinction with respect to the value 1?

Vanishing Gradients Exploding Gradients

38
Vanishing and Exploding Gradients Problem

Quick Question

Why do we make a case distinction with respect to the value 1?

39
RECAP TIME !
What’s the story so far?

40
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
41
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
42
Solving Vanishing and Exploding Gradients
● Weight Initialization: Identity, Xavier, He, etc

● Gradient Clipping: Limit the gradients’ magnitude

● Normalization Techniques: Batch and layer normalization

etc… 43
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory

Solutions

Training Architecture
44
Long Short-Term Memory

LSTMs are designed to have more persistent memory to capture long-term

dependencies through a gating mechanism

Long Short-Term

remembering information capturing short-term

over long sequences dependencies
Memory

increased memory
capacity over time

Seminal Work: Hochreiter S., Schmidhuber J., “Long Short-Term Memory”, Neural Computation 9(8):1735-1780. 1997 45
Long Short-Term Memory

LSTMs are designed to have more persistent memory to capture long-term

dependencies through a gating mechanism

The gating mechanism allows the network to learn

when to retain and when to forget a piece of
information depending on its relevance

46
Long Short-Term Memory

Step-by-Step into LSTM

Image from: Zhang et al. “Dive into Deep Learning”, Cambridge University Press, 2023 47
Long Short-Term Memory
Input Node: Integrates the new input word to the memory (similar to RNN)

48
Long Short-Term Memory
Memory Cell: Produces the final memory as a weighted aggregation of past
information to forget and new information to keep

Input Gate: Determines whether the input is worth keeping (word relevance)

Forget Gate: Assesses whether the past memory is useful for the computation of
the current memory

49
Long Short-Term Memory
Output Gate: Separates the final memory from the hidden state deciding what
parts of the memory need to be present in the hidden state

50
Long Short-Term Memory
Sigmoid: values in [0,1] & smooth function
⇒ ideal for gates (i.e. turn-on / turn-off)
Tanh: values in [-1,1] & zero-centered at 0
⇒ balanced activations

51
LSTM Solving the Vanishing Gradient Problem

More stability
thanks to the
memory cell!
52
LSTM Solving the Vanishing Gradient Problem

53
LSTM Solving the Vanishing Gradient Problem

LSTM only attenuates the

vanishing gradient effect;
it does not suppress it

54
RECAP TIME !
What’s the story so far?

55
Gated Recurrent Unit
¡ My Q&A Time !

Seminal Work: Cho et al., “Learning Phrase Representations using RNN

Encoder-Decoder for Statistical Machine Translation”, EMNLP, 2014 56
Gated Recurrent Unit

What are the structural differences between LSTM and GRU?

57
Gated Recurrent Unit

● From 3 to 2 gates ⇒ less parameters

● No cell state ⇒ merged memory

58
Gated Recurrent Unit

How many trainable

matrices does a GRU have?

59
Gated Recurrent Unit

6 Matrices
(3 FCs with 2 inputs each)

60
Gated Recurrent Unit

What type of model would

we have if the reset gate is
all 0’s? or all 1’s?

Reset Gate:

61
Gated Recurrent Unit

If all 0’s, we get a MLP

If all 1’s, we get a RNN

Reset Gate:

62
Gated Recurrent Unit

What if the update gate

is all 0’s? or all 1’s?

Update Gate:

63
Gated Recurrent Unit

If all 1’s, the new state is

the old state
If all 0’s, the new state is
the candidate state

Update Gate:

64
LSTM vs GRU

Differences between
LSTM and GRU

65
LSTM vs GRU
LSTM GRU

Independent memory cell state for Combined memory cell with hidden
storing information state
⇒ long-term dependencies ⇒ less parameters

LSTM controls the memory cell GRU controls the hidden state
Remove from the cell (forget gate) Add new information (update gate)
Add to the cell (input gate) Remove old information (reset gate)
Extract from the cell (output gate)
66
LSTM vs GRU — Autocomplete Task

Visualising long-term contextual understanding

[Link]

67
Course: Advanced Natural Language Processing

Beyond RNN
LSTM & GRU
Any Questions ?

University Year: 2024/2025

RNN and LSTM: Sequence Modeling Insights
No ratings yet
RNN and LSTM: Sequence Modeling Insights
42 pages
CNN and RNN Architectures Explained
No ratings yet
CNN and RNN Architectures Explained
36 pages
Understanding Transformer Architecture
No ratings yet
Understanding Transformer Architecture
55 pages
Architecting Transformers for LLMs
No ratings yet
Architecting Transformers for LLMs
27 pages
Graph Neural Networks Explained
No ratings yet
Graph Neural Networks Explained
58 pages
Deep Learning Interview Questions Guide
No ratings yet
Deep Learning Interview Questions Guide
36 pages
200 Essential Machine Learning & Deep Learning Questions
No ratings yet
200 Essential Machine Learning & Deep Learning Questions
19 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
57 pages
Transformer Architecture Complete Study Notes 1767497414
No ratings yet
Transformer Architecture Complete Study Notes 1767497414
27 pages
Deep Learning Viva Q&A Guide
No ratings yet
Deep Learning Viva Q&A Guide
4 pages
Math Refresher for Machine Learning
No ratings yet
Math Refresher for Machine Learning
21 pages
Understanding Set Notation Basics
No ratings yet
Understanding Set Notation Basics
63 pages
Pthread
No ratings yet
Pthread
4 pages
Understanding Autoencoders in Deep Learning
No ratings yet
Understanding Autoencoders in Deep Learning
39 pages
Pthreads API User's Guide
No ratings yet
Pthreads API User's Guide
348 pages
Deep Learning Interview Questions Guide
No ratings yet
Deep Learning Interview Questions Guide
17 pages
CNN Architecture and Functionality Overview
No ratings yet
CNN Architecture and Functionality Overview
66 pages
Understanding Computer Vision Basics
No ratings yet
Understanding Computer Vision Basics
120 pages
Understanding Parameter-Efficient Fine-Tuning
No ratings yet
Understanding Parameter-Efficient Fine-Tuning
10 pages
Understanding Convolutional Neural Networks
No ratings yet
Understanding Convolutional Neural Networks
28 pages
RNN Architectures: GRU vs LSTM vs RNN
No ratings yet
RNN Architectures: GRU vs LSTM vs RNN
129 pages
Deep Learning Fundamentals and Techniques
No ratings yet
Deep Learning Fundamentals and Techniques
7 pages
Overview of Autoencoders
No ratings yet
Overview of Autoencoders
22 pages
Deep Learning Viva Questions Guide
No ratings yet
Deep Learning Viva Questions Guide
7 pages
Understanding DCGAN Architecture and Training
No ratings yet
Understanding DCGAN Architecture and Training
27 pages
Deep Learning Performance Metrics Guide
100% (1)
Deep Learning Performance Metrics Guide
27 pages
Enhancing EdgeAI with SLM Techniques
No ratings yet
Enhancing EdgeAI with SLM Techniques
45 pages
CNN and Autoencoder Overview
No ratings yet
CNN and Autoencoder Overview
56 pages
Advanced Deep Learning Techniques
No ratings yet
Advanced Deep Learning Techniques
89 pages
Enhancing RAG with Embedding Alignment
No ratings yet
Enhancing RAG with Embedding Alignment
7 pages
Attention Book Sample
No ratings yet
Attention Book Sample
32 pages
Deep Learning Syllabus for B.Tech CSE
No ratings yet
Deep Learning Syllabus for B.Tech CSE
169 pages
Top 170 Machine Learning Interview Questions
No ratings yet
Top 170 Machine Learning Interview Questions
67 pages
PyTorch Neural Network Tutorial
No ratings yet
PyTorch Neural Network Tutorial
64 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
18 pages
Understanding Deep Learning Basics
No ratings yet
Understanding Deep Learning Basics
32 pages
CNN Architecture and Functionality Explained
No ratings yet
CNN Architecture and Functionality Explained
9 pages
Notes of Deep Learning Top Architectures
No ratings yet
Notes of Deep Learning Top Architectures
13 pages
Understanding the Softmax Function
No ratings yet
Understanding the Softmax Function
6 pages
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
No ratings yet
CNN Architectures: Lenet, Alexnet, VGG, Googlenet, Resnet and More
9 pages
Gradient Descent Variations Explained
No ratings yet
Gradient Descent Variations Explained
21 pages
PyTorch: Bridging Research and Production
No ratings yet
PyTorch: Bridging Research and Production
108 pages
Hidden Layer Parameters in Neural Networks
No ratings yet
Hidden Layer Parameters in Neural Networks
5 pages
Deep Learning: Concepts and Frameworks
No ratings yet
Deep Learning: Concepts and Frameworks
14 pages
Key Hyperparameters in Neural Networks
No ratings yet
Key Hyperparameters in Neural Networks
15 pages
MLP Overview in Soft Computing
No ratings yet
MLP Overview in Soft Computing
20 pages
Understanding Batch Normalization in CNNs
No ratings yet
Understanding Batch Normalization in CNNs
20 pages
LSTM Attention Model for Translation
No ratings yet
LSTM Attention Model for Translation
11 pages
Backpropagation Algorithm Explained
No ratings yet
Backpropagation Algorithm Explained
4 pages
Training Multi-Layer Feedforward DNNs
No ratings yet
Training Multi-Layer Feedforward DNNs
9 pages
Naive Bayes and Perceptron in ML
No ratings yet
Naive Bayes and Perceptron in ML
2 pages
Encoder-Decoder Models in Deep Learning
No ratings yet
Encoder-Decoder Models in Deep Learning
63 pages
Graph Neural Networks Overview
No ratings yet
Graph Neural Networks Overview
1 page
Deep Neural Networks Assignment Guide
No ratings yet
Deep Neural Networks Assignment Guide
2 pages
Deep Learning Notes Overview
No ratings yet
Deep Learning Notes Overview
4 pages
RNN and LSTM Overview and Applications
No ratings yet
RNN and LSTM Overview and Applications
43 pages
RNN and LSTM Applications Overview
No ratings yet
RNN and LSTM Applications Overview
35 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
36 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
29 pages
LSTM Overview and Applications
No ratings yet
LSTM Overview and Applications
72 pages
AI in Engineering Exam Questions
No ratings yet
AI in Engineering Exam Questions
1 page
Understanding Types of Artificial Intelligence
No ratings yet
Understanding Types of Artificial Intelligence
2 pages
McCulloch-Pitts Neuron vs Perceptron
No ratings yet
McCulloch-Pitts Neuron vs Perceptron
15 pages
Unit 4 - Week 3: Assignment 3
No ratings yet
Unit 4 - Week 3: Assignment 3
3 pages
Survey of Vision-Language-Action Models
No ratings yet
Survey of Vision-Language-Action Models
32 pages
Overview of Artificial Intelligence Concepts
No ratings yet
Overview of Artificial Intelligence Concepts
5 pages
Artificial Neural Networks Syllabus
No ratings yet
Artificial Neural Networks Syllabus
2 pages
Financial Time Series Forecasting Using CNN and Transformer
No ratings yet
Financial Time Series Forecasting Using CNN and Transformer
4 pages
Generative AI Tools and Applications
No ratings yet
Generative AI Tools and Applications
6 pages
PyTorch User Guide with Code
No ratings yet
PyTorch User Guide with Code
4 pages
Transformer Architecture in NLP
No ratings yet
Transformer Architecture in NLP
2 pages
AI Learning Techniques Overview
No ratings yet
AI Learning Techniques Overview
29 pages
RWKV: RNNs for the Transformer Era
No ratings yet
RWKV: RNNs for the Transformer Era
25 pages
Survey of GPT-3 Family Models
No ratings yet
Survey of GPT-3 Family Models
700 pages
Introduction to Machine Learning Concepts
No ratings yet
Introduction to Machine Learning Concepts
46 pages
Contextual Data Augmentation for NLP
No ratings yet
Contextual Data Augmentation for NLP
6 pages
Understanding Recurrent Neural Networks
No ratings yet
Understanding Recurrent Neural Networks
51 pages
Understanding Computational Units in ANN
No ratings yet
Understanding Computational Units in ANN
48 pages
Introduction to Neural Networks
No ratings yet
Introduction to Neural Networks
9 pages
Multilayer Feed-Forward Neural Networks
No ratings yet
Multilayer Feed-Forward Neural Networks
17 pages
Handwritten Math Expression Recognition
No ratings yet
Handwritten Math Expression Recognition
5 pages
AIO 2023: Competition Overview and Details
No ratings yet
AIO 2023: Competition Overview and Details
33 pages
Deep Learning Overview and Applications
No ratings yet
Deep Learning Overview and Applications
57 pages
Back Propagation Algorithm
No ratings yet
Back Propagation Algorithm
4 pages
Week 2 Deep Learning Assignment Questions
No ratings yet
Week 2 Deep Learning Assignment Questions
1 page
Deep Learning for Parking Space Detection
No ratings yet
Deep Learning for Parking Space Detection
42 pages
ACINT 2024: Call for Papers
No ratings yet
ACINT 2024: Call for Papers
2 pages
Overview of Cognitive Architecture
No ratings yet
Overview of Cognitive Architecture
6 pages
Overview of Recurrent Neural Networks
No ratings yet
Overview of Recurrent Neural Networks
11 pages
TensorFlow Cheat Sheet for Models
No ratings yet
TensorFlow Cheat Sheet for Models
12 pages