0% found this document useful (0 votes)
10 views

LSTM & GRU

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

LSTM & GRU

Copyright
© © All Rights Reserved
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 17

Introduction to Long Short Term Memory

(LSTM)
https://www.analyticsvidhya.com/blog/2021/03/introduction-to-long-short-term-
memory-lstm/?utm_source=blog&utm_medium=gated_recurrent_unit

https://youtu.be/Z03f7Wu5a6A
Shipra Saxena — Published On March 16, 2021 and Last Modified On March 18th, 2021

Advanced Deep Learning Videos

Objective
● LSTM is a special kind of recurrent neural network capable of
handling long-term dependencies.

● Understand the architecture and working of an LSTM network

Introduction
Long Short Term Memory Network is an advanced RNN, a sequential

network, that allows information to persist. It is capable of handling

the vanishing gradient problem faced by RNN. A recurrent neural

network also known as RNN is used for persistent memory.

Let’s say while watching a video you remember the previous scene or

while reading a book you know what happened in the earlier chapter.

Similarly RNNs work, they remember the previous information and

use it for processing the current input. The shortcoming of RNN is,
they can not remember Long term dependencies due to vanishing

gradient. LSTMs are explicitly designed to avoid long-term

dependency problems.

LSTM Architecture
At a high-level LSTM works very much like an RNN cell. Here is the

internal functioning of the LSTM network. The LSTM consists of three

parts, as shown in the image below and each part performs an

individual function.

The first part chooses whether the information coming from the

previous timestamp is to be remembered or is irrelevant and can be

forgotten. In the second part, the cell tries to learn new information

from the input to this cell. At last, in the third part, the cell passes the

updated information from the current timestamp to the next

timestamp.
These three parts of an LSTM cell are known as gates. The first part is

called Forget gate, the second part is known as the Input gate and the

last one is the Output gate.

Just like a simple RNN, an LSTM also has a hidden state where H(t-1)

represents the hidden state of the previous timestamp and Ht is the

hidden state of the current timestamp. In addition to that LSTM also

have a cell state represented by C(t-1) and C(t) for previous and

current timestamp respectively.

Here the hidden state is known as Short term memory and the cell

state is known as Long term memory. Refer to the following image.


It is interesting to note that the cell state carries the information along

with all the timestamps.

Let’s take an example to understand how LSTM works. Here we have

two sentences separated by a full stop. The first sentence is “Bob is a

nice person” and the second sentence is “Dan, on the Other hand, is
evil”. It is very clear, in the first sentence we are talking about Bob

and as soon as we encounter the full stop(.) we started talking about

Dan.

As we move from the first sentence to the second sentence, our

network should realize that we are no more talking about Bob. Now

our subject is Dan. Here, the Forget gate of the network allows it to

forget about it. Let’s understand the roles played by these gates in

LSTM architecture.

Forget Gate

In a cell of the LSTM network, the first step is to decide whether we

should keep the information from the previous timestamp or forget it.

Here is the equation for forget gate.

Let’s try to understand the equation, here

● Xt: input to the current timestamp.


● Uf: weight associated with the input

● Ht-1: The hidden state of the previous timestamp

● Wf: It is the weight matrix associated with hidden state

Later, a sigmoid function is applied over it. That will make ft a number

between 0 and 1. This ft is later multiplied with the cell state of the

previous timestamp as shown below.

If ft is 0 then the network will forget everything and if the value of ft is

1 it will forget nothing. Let’s get back to our example, The first

sentence was talking about Bob and after a full stop, the network will

encounter Dan, in an ideal case the network should forget about Bob.

Input Gate

Let’s take another example


“Bob knows swimming. He told me over the phone that he had served

the navy for four long years.”

So, in both these sentences, we are talking about Bob. However, both

give different kinds of information about Bob. In the first sentence, we

get the information that he knows swimming. Whereas the second

sentence tells he uses the phone and served in the navy for four

years.

Now just think about it, based on the context given in the first

sentence, which information of the second sentence is critical. First,

he used the phone to tell or he served in the navy. In this context, it

doesn’t matter whether he used the phone or any other medium of

communication to pass on the information. The fact that he was in the

navy is important information and this is something we want our

model to remember. This is the task of the Input gate.

Input gate is used to quantify the importance of the new information

carried by the input. Here is the equation of the input gate


Here,

● Xt: Input at the current timestamp t

● Ui: weight matrix of input

● Ht-1: A hidden state at the previous timestamp

● Wi: Weight matrix of input associated with hidden state

Again we have applied sigmoid function over it. As a result, the value

of I at timestamp t will be between 0 and 1.

New information

Now the new information that needed to be passed to the cell state is

a function of a hidden state at the previous timestamp t-1 and input x

at timestamp t. The activation function here is tanh. Due to the tanh

function, the value of new information will between -1 and 1. If the

value is of Nt is negative the information is subtracted from the cell

state and if the value is positive the information is added to the cell

state at the current timestamp.


However, the Nt won’t be added directly to the cell state. Here comes

the updated equation

Here, Ct-1 is the cell state at the current timestamp and others are the

values we have calculated previously.

Output Gate

Now consider this sentence

“Bob single-handedly fought the enemy and died for his country. For

his contributions, brave________ .”

During this task, we have to complete the second sentence. Now, the

minute we see the word brave, we know that we are talking about a

person. In the sentence only Bob is brave, we can not say the enemy is

brave or the country is brave. So based on the current expectation we

have to give a relevant word to fill in the blank. That word is our

output and this is the function of our Output gate.

Here is the equation of the Output gate, which is pretty similar to the

two previous gates.


Its value will also lie between 0 and 1 because of this sigmoid function.

Now to calculate the current hidden state we will use Ot and tanh of

the updated cell state. As shown below.

It turns out that the hidden state is a function of Long term memory

(Ct) and the current output. If you need to take the output of the

current timestamp just apply the SoftMax activation on hidden state

Ht.
Here the token

with the maximum score in the output is the prediction.

This is the More intuitive diagram of the LSTM network.

This diagram is taken from an interesting blog. I urge you all to go

through it. Here is the link-

● Understanding LSTM Networks

End Notes
To summarize, in this article we saw the architecture of a sequential

model LSTM and how it works in detail.


Introduction to Gated Recurrent Unit
(GRU)
Shipra Saxena — Published On March 17, 2021 and Last Modified On March 18th, 2021
Advanced Deep Learning Videos

https://www.analyticsvidhya.com/blog/2021/03/introduction-to-gated-recurrent-unit-gru/
https://youtu.be/6IBhu4tDpOI
Objective
● In sequence modeling techniques, the Gated Recurrent Unit is

the newest entrant after RNN and LSTM, hence it offers an

improvement over the other two.

● Understand the working of GRU and how it is different from

LSTM

Introduction
GRU or Gated recurrent unit is an advancement of the standard RNN

i.e recurrent neural network. It was introduced by Kyunghyun Cho et

al in the year 2014.

Note: If you are more interested in learning concepts in an Audio-

Visual format, We have this entire article explained in the video below.

If not, you may continue reading.


GRUs are very similar to Long Short Term Memory(LSTM). Just like

LSTM, GRU uses gates to control the flow of information. They are

relatively new as compared to LSTM. This is the reason they offer

some improvement over LSTM and have simpler architecture.

Another Interesting thing about GRU is that, unlike LSTM, it does not

have a separate cell state (Ct). It only has a hidden state(Ht). Due to

the simpler architecture, GRUs are faster to train.

In case you are unaware of the LSTM network, I will suggest you go

through the following article-

● Introduction to Long Short term Memory(LSTM)

The architecture of Gated Recurrent Unit


Now lets’ understand how GRU works. Here we have a GRU cell which

more or less similar to an LSTM cell or RNN cell.


At each timestamp t, it takes an input

Xt and the hidden state Ht-1 from the previous timestamp t-1. Later it

outputs a new hidden state Ht which again passed to the next

timestamp.

Now there are primarily two gates in a GRU as opposed to three gates

in an LSTM cell. The first gate is the Reset gate and the other one is

the update gate.

Reset Gate (Short term memory)


The Reset Gate is responsible for the short-term memory of the

network i.e the hidden state (Ht). Here is the equation of the Reset

gate.

If you remember from the LSTM gate equation it is very similar to

that. The value of rt will range from 0 to 1 because of the sigmoid

function. Here Ur and Wr are weight matrices for the reset gate.
Update Gate (Long Term memory)
Similarly, we have an Update gate for long-term memory and the

equation of the gate is shown below.

The only difference is of weight metrics i.e Uu and Wu.

How GRU Works


Now let’s see the functioning of these gates. To find the Hidden state

Ht in GRU, it follows a two-step process. The first step is to generate

what is known as the candidate hidden state. As shown below

Candidate Hidden State

It takes in the input and the hidden state from the previous timestamp

t-1 which is multiplied by the reset gate output rt. Later passed this

entire information to the tanh function, the resultant value is the

candidate’s hidden state.

The most important part of this equation is how we are using the value

of the reset gate to control how much influence the previous hidden

state can have on the candidate state.


If the value of rt is equal to 1 then it means the entire information

from the previous hidden state Ht-1 is being considered. Likewise, if

the value of rt is 0 then that means the information from the previous

hidden state is completely ignored.

Hidden state
Once we have the candidate state, it is used to generate the current

hidden state Ht. It is where the Update gate comes into the picture.

Now, this is a very interesting equation, instead of using a separate

gate like in LSTM in GRU we use a single update gate to control both

the historical information which is Ht-1 as well as the new information

which comes from the candidate state.

Now assume the value of ut is around 0 then the first term in the

equation will vanish which means the new hidden state will not have

much information from the previous hidden state. On the other hand,

the second part becomes almost one that essentially means the hidden

state at the current timestamp will consist of the information from the

candidate state only.


Similarly, if the value of ut is on the second term will become entirely

0 and the current hidden state will entirely depend on the first term i.e

the information from the hidden state at the previous timestamp t-1.

Hence we can conclude that the value of ut is very critical in this

equation and it can range from 0 to 1.

In case, you are interested to know more about GRU I suggest you

read this Paper.

End Notes
So just to summarize, Let’s see how different GRU is from LSTM.

LSTM has three gates on the other hand GRU has only two gates.

In LSTM they are the Input gate, Forget gate, and Output gate.

Whereas in GRU we have a Reset gate and Update gate.

In LSTM we have two states Cell state or Long term memory and

Hidden state also known as Short term memory.

In the case of GRU, there is only one state i.e Hidden state (Ht).

You might also like