0% found this document useful (0 votes)
29 views

Deep Learning (MODULE-5)

Deep Learning

Uploaded by

tanishqgupta1102
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views

Deep Learning (MODULE-5)

Deep Learning

Uploaded by

tanishqgupta1102
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 71

BCSE332L

DEEP LEARNING

Module: 5
Module:5
RECURSIVE NEURAL NETWORKS
1. Long-Term Dependencies –
2. Echo State Networks –
3. Long Short-Term Memory and
4. Other Gated RNNs –
5. Optimization for Long-Term
Dependencies
6. Explicit Memory.
1:) RECURRENT NEURAL NETWORKS
INTRODUCTION:
Recursive Neural Networks (RvNNs) are
deep neural networks used for natural
language processing.
We get a Recursive Neural Network when
the same weights are applied recursively on a
structured input to obtain a structured
prediction.
1:) RECURRENT NEURAL NETWORKS
INTRODUCTION:
What Is a Recursive Neural Network?
Deep Learning is a subfield of machine learning
and artificial intelligence (AI) that attempts to
imitate how the human brain processes data and gains
certain knowledge.
Neural Networks form the backbone of Deep
Learning.
These are loosely modeled after the human brain
and designed to accurately recognize underlying
patterns in a data set. If you want to predict the
unpredictable, Deep Learning is the solution.
1:) RECURRENT NEURAL NETWORKS
INTRODUCTION:
What Is a Recursive Neural Network?
Due to their deep tree-like structure,
Recursive Neural Networks can handle hierarchical
data.
The tree structure means combining child nodes
and producing parent nodes. Each child-parent bond
has a weight matrix, and similar children have the same
weights.
The number of children for every node in the tree is
fixed to enable it to perform recursive operations and
use the same weights. RvNNs are used when there's a
need to parse an entire sentence.
1:) RECURRENT NEURAL NETWORKS
Recurrent Neural Network vs. Recursive Neural
Networks
1:) RECUrsive NEURAL NETWORKS
INTRODUCTION:
1:) RECURRENT NEURAL NETWORKS
INTRODUCTION:
1:) RECURRENT NEURAL NETWORKS
INTRODUCTION:
1:) LONG TERM DEPENDENCIES
Why long-term dependencies?
1:) LONG TERM DEPENDENCIES
INTRODUCTION:
What are long-term dependencies?
Long-term dependencies are the situations where
the output of an RNN depends on the input that occurred
many time steps ago. For instance, consider the sentence
"The cat, which was very hungry, ate the mouse".
To understand the meaning of this sentence, you need
to remember that the cat is the subject of the verb ate,
even though they are separated by a long clause.
This is a long-term dependency, and it can affect the
performance of an RNN that tries to generate or analyze
such sentences.
1:) LONG TERM DEPENDENCIES
2.) Why are long-term dependencies??
Recurrent neural networks (RNNs) are
powerful machine learning models that can
process sequential data, such as text, speech, or
video.
However, they often struggle to capture
long-term dependencies, which are the
relationships between distant elements in the
sequence
1:) LONG TERM DEPENDENCIES
2.) Why are long-term dependencies hard to
learn?
The main reason why long-term
dependencies are hard to learn is that RNNs
suffer from the vanishing or exploding gradient
problem.
This means that the gradient, which is the
signal that tells the network how to update its
weights, becomes either very small or very
large as it propagates through the network.
1:) LONG TERM DEPENDENCIES
2.) Why are long-term dependencies hard to
learn?
When the gradient vanishes, the network
cannot learn from the distant inputs, and when
it explodes, the network becomes unstable and
produces erratic outputs.
This problem is caused by the repeated
multiplication of the same matrix, which
represents the connections between the hidden
units, at each time step.
1:) LONG TERM DEPENDENCIES
2.) How can you use gated units to handle long-
term dependencies?
Another way to handle long-term
dependencies is to use gated units, which are
special types of hidden units that can control
the flow of information in the network.
The most popular gated units are the long
short-term memory (LSTM) and the gated
recurrent unit (GRU).
1:) LONG TERM DEPENDENCIES
2.) How can you use gated units to handle long-
term dependencies?
These units have internal mechanisms that
allow them to remember or forget the previous
inputs and outputs, depending on the current
input and output.
This way, they can selectively access the
relevant information from the and
ignore the irrelevant information. distant past
1:) LONG TERM DEPENDENCIES
2.) How can you use attention mechanisms to
handle long-term dependencies?
Another way to handle long-term
dependencies is to use attention mechanisms,
which are modules that can learn to focus on the
most important parts of the input or output
sequence. The most common attention
mechanism is the self-attention, which computes
1:) LONG TERM DEPENDENCIES
2.) How can you use attention mechanisms to
handle long-term dependencies?
the similarity between each element in the
sequence and assigns a weight to each one.
Then, it uses these weights to create a
context vector, which summarizes the
information from the whole sequence.
This way, it can capture the relationships
between the distant elements and enhance the
representation of the sequence.
1:) LONG TERM DEPENDENCIES
Challenges:
LSTM
LSTM
LSTM
LSTM
LSTM

LSTM-Notations
LSTM
LSTM Inputs and Outputs of the unit
The network takes three inputs.

X_t is the input of the current time step.


h_t-1 is the output from the previous LSTM unit
and
C_t-1 is the “memory” of the previous unit, is the
most important input.
As for outputs,
h_t is the output of the current network.
C_t is the memory of the current unit.
LSTM Process of o/p Gate
C_t is the memory of the current unit.
Internal memory C_t changes is pretty similar to
piping water through a pipe.
Assuming the memory is water, it flows into pipe.
You want to change this memory flow along the
way and this change is controlled by two valves.
LSTM Process of o/p Gate
C_t is the memory of the current unit.

The first valve is called the forget valve.

If you shut it, no old memory will be kept.

If you fully open this valve, all old memory will


pass through.
LSTM Process of o/p Gate
C_t is the memory of the current unit.

The second valve is the new memory valve.


New memory will come in through a T shaped
joint like above and merge with the old memory.
Exactly how much new memory should come in is
controlled by the second valve.
LSTM
Process of Forget Gate

Ram Likes Garden


LSTM Process of i/p Gate

Candidate t-
New Input Relevant
check
LSTM Process of o/p Gate
LSTM Process of o/p Gate
LSTM Overall LSTM Architecture
LSTM
Advantages of LSTM
LSTM cells have several advantages over simple
RNN cells, such as their ability to learn long-term
dependencies and capture complex patterns in
sequential data.

For example:
They can predict the next word in a sentence
based on the previous words and the context, or
generate captions for images based on the visual
features and the language model.
LSTM
Advantages of LSTM
LSTM cells can avoid the vanishing or exploding
gradient problem, allowing them to learn from
longer sequences without losing or amplifying the
information.
To translate a long sentence from one language
to another without forgetting or distorting the
meaning.
They can handle noisy or missing data better than
simple RNN cells, such as filling in the blanks or
correcting errors in a text based on the surrounding
words and grammar.
LSTM Disadvantages of LSTM
LSTM cells have some drawbacks when
compared to simple RNN cells.
They are more computationally expensive and
require more memory and time to train and run
due to their additional parameters and operations.
Additionally, they are more prone to overfitting,
necessitating regularization techniques such as
dropout, weight decay, or early stopping.
Finally, they are harder to interpret and explain
than simple RNN cells since they have more hidden
layers and states.
LSTM Other Gated RNN
Gated recurrent units (GRUs) are a gating
mechanism in recurrent neural networks, introduced
in 2014.
The GRU is like a long short-term memory (LSTM)
with a gating mechanism to input or forget certain
features, but lacks a context vector or output gate,
resulting in fewer parameters than LSTM.
GRU's performance on certain tasks of polyphonic
music modeling, speech signal modeling and natural
language processing was found to be similar to that
of LSTM
LSTM Other Gated RNN
Gated Recurrent Network
The main difference with the LSTM is that a
single gating unit simultaneously controls the
forgetting factor and the decision to update the state
unit.

The update equations are the following:

where u stands for “update” gate and r for “reset” gate.


LSTM Other Gated RNN

The update gates act like conditional leaky


integrators that can linearly gate any dimension, thus
choosing to copy it (at one extreme of the sigmoid)
or completely ignore it (at the other extreme) by
replacing it by the new “target state” value
LSTM Other Gated RNN

The reset gates control which parts of the state


get used to compute the next target state,
introducing an additional nonlinear effect in the
relationship between past state and future state.
LSTM Other Gated RNN

For example:
The reset gate (or forget gate) output could be
shared across multiple hidden units.
Alternately, the product of

A global gate (covering a whole group of units,


such as an entire layer) and

A local gate (per unit) could be used to combine


global control and local control.
LSTM

Reference Link

https://colah.github.io/posts/2015-08-
Understanding-LSTMs/
Optimization for Long-Term
Dependencies
Second-order optimization algorithms may
roughly be understood as dividing the first derivative
by the second derivative (in higher dimension,
multiplying the gradient by the inverse Hessian).

If the second derivative shrinks at a similar rate to


the first derivative, then the ratio of first and second
derivatives may remain relatively constant.
Optimization for Long-Term
Dependencies

Unfortunately, second-order methods have


many drawbacks, including high computational cost,
the need for a large minibatch, and a tendency to be
attracted to saddle points.

Another simpler methods such as Nesterov


Momentum with careful initialization could achieve
similar results.
Optimization for Long-Term
Dependencies
Both of these approaches have largely been
Replaced by simply using SGD (even without
momentum) applied to LSTMs.

This is part of a continuing theme in machine


learning that it is often much easier to design a
model that is easy to optimize than it is to design a
more powerful optimization algorithm.
Optimization for Long-Term
Dependencies
Optimization for Long-Term
Dependencies
Optimization for Long-Term
Dependencies

(i) Clipping Gradients

(ii) Regularizing to Encourage Information Flow


Optimization for Long-Term
Dependencies

(i) Clipping Gradients


Optimization for Long-Term
Dependencies

(i) Clipping Gradients


Gradient Clipping is the process that helps
maintain numerical stability by preventing the
gradients from growing too large.
When training a neural network, the loss
gradients are computed through backpropagation
However, if these gradients become too large, the
updates to the model weights can also become
excessively large, leading to numerical instability.
Optimization for Long-Term
Dependencies

(i) Clipping Gradients


Gradient Clipping is a technique used during
the training of neural networks to address the issue
of exploding gradients. When the gradients of the loss
function concerning the parameters become too
large, it can cause the model’s weights to be updated
by huge amounts, leading to numerical instability and
a slow or even halted convergence of the training
process.
Optimization for Long-Term
Dependencies

(i) Clipping Gradients


Gradient clipping is a very effective technique
that helps address the exploding gradient problem
during training.
By limiting the magnitude of the gradients, it
helps to prevent them from growing unchecked and
becoming too large. This ensures that the model
learns more effectively and prevents it from getting
stuck in a local minima.
Optimization for Long-Term
Dependencies
2:) ECHO STATE NETWORK
INTRODUCTION:
Echo State Networks (ESNs) are a specific
kind of recurrent neural network (RNN)
designed to efficiently handle sequential data.

In Python, ESNs are created as a reservoir


computing framework, which includes a fixed,
randomly initialized recurrent layer known as
the “reservoir.”
2:) ECHO STATE NETWORK
INTRODUCTION:
The key feature of ESNs is their ability to
make the most of the echo-like dynamics of the
reservoir, allowing them to effectively capture
and replicate temporal patterns in sequential
input.
The way ESNs work is by linearly
combining the input and reservoir states to
generate the network’s output.
2:) ECHO STATE NETWORK
INTRODUCTION:
The reservoir weights remain fixed.
This unique approach makes ESNs
particularly useful in tasks where capturing
temporal dependencies is critical, such as time-
series prediction and signal processing.
Implementing ESNs in Python, researchers
and practitioners often turn to libraries like
PyTorch or specialized reservoir computing
frameworks.
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
Echo State Networks (ESNs) in Python are
a fascinating type of recurrent neural network
(RNN) tailored for handling sequential data.
Imagine it as a three-part orchestra:
there’s the input layer, a reservoir filled with
randomly initialized interconnected neurons,
and the output layer
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
The magic lies in the reservoir, where the
weights are like a musical improvisation – fixed
and randomly assigned.
This creates an “echo” effect, capturing the
dynamics of the input signal. During training,
we tweak only the output layer, guiding it to
map the reservoir’s states to the desired output.
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
An Echo State Network (ESN) in Python is
like a smart system that can predict what comes
next in a sequence of data..
Imagine you have a list of numbers or
values, like the temperature each day.
An ESN can learn from this data and then
try to guess the temperature for the next day.
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
2:) ECHO STATE NETWORK
Applications of Echo-State Networks
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
2:) ECHO STATE NETWORK
How Echo-State Networks work ?
Imagine an Echo State Network (ESN) in
Python as a smart musical instrument player
you’re trying to teach the player to mimic a
song.
The player has a large collection of notes
(reservoir), and when you play the first few
notes of the song (input), the player responds
with its interpretation of the melody.
2:) ECHO STATE NETWORK
WHAT IS ECHO STATE NETWORKS?
2:) ECHO STATE NETWORK
Concepts of Echo-State Networks
2:) ECHO STATE NETWORK
Concepts of Echo-State Networks
2:) ECHO STATE NETWORK
Concepts of Echo-State Networks
2:) ECHO STATE NETWORK
Concepts of Echo-State Networks

You might also like