Course: Advanced Natural Language Processing
Beyond RNN
LSTM & GRU
University Year: 2024/2025
Recall on Vanilla RNN
Input Data
Inference
I/O Mapping
Loss Function
Training
2
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
Order
Dependencies
3
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
Order
Dependencies
If the data elements are simply ordered (e.g. by size, arrival time),
this does not imply the data is sequential; we need inter-elements
semantic/contextual relationships (e.g. to make predictions)
4
Recall on Vanilla RNN: Input Data
RNNs are designed for processing sequential data (e.g. text, video, time series)
Are these relationships necessary for task modelling ?
No Yes
Non-sequential models (e.g. MLP) Sequential models (e.g. RNN)
Whether or not the data is sequential, the If the data is sequential, the model will extract
model will not leverage the data dependencies the relationships between data elements
The past has no impact on the present The past influences the present
5
Recall on Vanilla RNN: Inference
RNNs maintain a memory (hidden state) of previous inputs
RNN vs MLP
Same matrices
for all inputs!
Seminal Work: Rumelhart et al. “Learning internal representations 6
by error propagation”, Tech. rep. ICS 8504. 1985
Recall on Vanilla RNN: I/O Mapping
7
Use-case Examples
8
Use-case Examples
One-Hidden
Layer NN
9
Use-case Examples
One-Hidden Text
Layer NN Generation
10
Use-case Examples
One-Hidden Text Sentiment
Layer NN Generation Analysis
11
Use-case Examples
One-Hidden Text Sentiment
Text Translation
Layer NN Generation Analysis
12
Use-case Examples
One-Hidden Text Sentiment POS Tagging,
Text Translation
Layer NN Generation Analysis NER
13
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary
Example: Part-of-Speech Tagging
Sentence: Tensorflow is very easy
POS Tags: NOUN VERB ADJ ADV
Predictions are thus a distribution over the set
of unique tags/classes composed of
{NOUN, VERB, ADJ, ADV}
Classification task per timestep
14
Recall on Vanilla RNN: Loss Function
RNNs are often trained to minimise the cross entropy loss over the entire vocabulary
Loss for one timestep
Loss for all timesteps
(rather average)
Perplexity
(the lower the perplexity, the more
confident the next word prediction)
15
Recall on Vanilla RNN: Training
RNNs are trained using backpropagation through time (BPTT)
16
Assessing Vanilla RNN
Pros
● Current state uses information from
earlier steps
● RNNs process input sequences of
any length
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps
17
Assessing Vanilla RNN
Pros Cons
● Current state uses information from ● RNNs are sequential thus cannot
earlier steps be parallelized
● RNNs process input sequences of ● Long-term dependencies (i.e.
any length information from many steps back)
are hardly captured
● The model size is independent of
the input sequence length
● The same weight matrices are
applied to all timesteps
18
Vanilla RNN Issues
Cons
● RNNs are sequential thus cannot
be parallelized
○ Transformers (Vaswani et al., 2017)
[Next lecture]
○ Minimal LSTM/GRU (Feng et al., 2024)
● Long-term dependencies (i.e.
information from many steps back)
are hardly captured
Transformer Architecture 19
Vanilla RNN Issues Red Row Sequential input data
Green Row Unrolled RNN
Cons Blue Row Output along sequence
● RNNs are sequential thus cannot be
parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]
Syntactics
Stack RNN cells for more
memory capacity
RNNs as general-purpose computers
(Turing complete) bottom-up, then left to right 20
Vanilla RNN Issues Red Row Sequential input data
Green Row Unrolled RNN
Cons Blue Row Output along sequence
● RNNs are sequential thus cannot be
parallelized
● Long-range dependencies (i.e.
Semantics
information from many steps
back) are hardly captured
[today’s lecture]
Syntactics
Still The Problem Persists!
21
Let’s Analyse…
RNNs are trained using backpropagation through time (BPTT)
22
‘T’ outputs, thus ‘T’ error terms
23
‘t’ timesteps, thus ‘t’ derivatives
24
Chain Rule (Recall)
25
26
27
28
29
Chain Rule (Again!)
30
31
NUMER
IC
INSTAB AL
ILITY
32
EVEN W
ORSE
Too sensitive!
33
Vanishing and Exploding Gradients
Vanishing Gradients: gradients Exploding Gradients: gradients
become extremely small become excessively large
● negligible update of weights ● large update of weights
● slow training ● instability and divergence
● harder to detect ● easy to detect
34
Vanishing and Exploding Gradients Problem
Vanishing Gradients Exploding Gradients
35
Vanishing and Exploding Gradients Problem
Suppose all gradients
are upper bounded by
“c”
The input sequence length
can be represented by
“x = t - k”
Original Study: Pascanu et al., “On the difficulty of
training Recurrent Neural Networks”, ICML, 2013 36
Exploding Gradients are NOT ALWAYS a Problem
No Effort At All <> Small Consistent Effort
(Just Saying!)
37
Vanishing and Exploding Gradients Problem
Quick Question
Why do we make a case distinction with respect to the value 1?
Vanishing Gradients Exploding Gradients
38
Vanishing and Exploding Gradients Problem
Quick Question
Why do we make a case distinction with respect to the value 1?
39
RECAP TIME !
What’s the story so far?
40
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
41
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
42
Solving Vanishing and Exploding Gradients
● Weight Initialization: Identity, Xavier, He, etc
● Gradient Clipping: Limit the gradients’ magnitude
● Normalization Techniques: Batch and layer normalization
etc… 43
Major Issue for RNN
Problem
Long-term dependencies are hard to capture
due to the vanishing gradient problem and lack of complex memory
Solutions
Training Architecture
44
Long Short-Term Memory
LSTMs are designed to have more persistent memory to capture long-term
dependencies through a gating mechanism
Long Short-Term
remembering information capturing short-term
over long sequences dependencies
Memory
increased memory
capacity over time
Seminal Work: Hochreiter S., Schmidhuber J., “Long Short-Term Memory”, Neural Computation 9(8):1735-1780. 1997 45
Long Short-Term Memory
LSTMs are designed to have more persistent memory to capture long-term
dependencies through a gating mechanism
The gating mechanism allows the network to learn
when to retain and when to forget a piece of
information depending on its relevance
46
Long Short-Term Memory
Step-by-Step into LSTM
Image from: Zhang et al. “Dive into Deep Learning”, Cambridge University Press, 2023 47
Long Short-Term Memory
Input Node: Integrates the new input word to the memory (similar to RNN)
48
Long Short-Term Memory
Memory Cell: Produces the final memory as a weighted aggregation of past
information to forget and new information to keep
Input Gate: Determines whether the input is worth keeping (word relevance)
Forget Gate: Assesses whether the past memory is useful for the computation of
the current memory
49
Long Short-Term Memory
Output Gate: Separates the final memory from the hidden state deciding what
parts of the memory need to be present in the hidden state
50
Long Short-Term Memory
Sigmoid: values in [0,1] & smooth function
⇒ ideal for gates (i.e. turn-on / turn-off)
Tanh: values in [-1,1] & zero-centered at 0
⇒ balanced activations
51
LSTM Solving the Vanishing Gradient Problem
More stability
thanks to the
memory cell!
52
LSTM Solving the Vanishing Gradient Problem
53
LSTM Solving the Vanishing Gradient Problem
LSTM only attenuates the
vanishing gradient effect;
it does not suppress it
54
RECAP TIME !
What’s the story so far?
55
Gated Recurrent Unit
¡ My Q&A Time !
Seminal Work: Cho et al., “Learning Phrase Representations using RNN
Encoder-Decoder for Statistical Machine Translation”, EMNLP, 2014 56
Gated Recurrent Unit
What are the structural differences between LSTM and GRU?
57
Gated Recurrent Unit
● From 3 to 2 gates ⇒ less parameters
● No cell state ⇒ merged memory
58
Gated Recurrent Unit
How many trainable
matrices does a GRU have?
59
Gated Recurrent Unit
6 Matrices
(3 FCs with 2 inputs each)
60
Gated Recurrent Unit
What type of model would
we have if the reset gate is
all 0’s? or all 1’s?
Reset Gate:
61
Gated Recurrent Unit
If all 0’s, we get a MLP
If all 1’s, we get a RNN
Reset Gate:
62
Gated Recurrent Unit
What if the update gate
is all 0’s? or all 1’s?
Update Gate:
63
Gated Recurrent Unit
If all 1’s, the new state is
the old state
If all 0’s, the new state is
the candidate state
Update Gate:
64
LSTM vs GRU
Differences between
LSTM and GRU
65
LSTM vs GRU
LSTM GRU
Independent memory cell state for Combined memory cell with hidden
storing information state
⇒ long-term dependencies ⇒ less parameters
LSTM controls the memory cell GRU controls the hidden state
Remove from the cell (forget gate) Add new information (update gate)
Add to the cell (input gate) Remove old information (reset gate)
Extract from the cell (output gate)
66
LSTM vs GRU — Autocomplete Task
Visualising long-term contextual understanding
[Link]
67
Course: Advanced Natural Language Processing
Beyond RNN
LSTM & GRU
Any Questions ?
University Year: 2024/2025