Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
Deep Learning - AD3501 - Notes - Unit 3 - Recurrent Neural Networks
Introduction to RNN
Traditional neural networks mainly have independent input and output layers, which make them
inefficient when dealing with sequential data. Hence, a new neural network called Recurrent
Neural Network, introduced to store results of previous outputs in the internal memory. These
results are then fed into the network inputs in order to predict the output of the layer. This allows
it to be used in applications like pattern detection, speech and voice recognition, natural language
processing, and time series prediction.
Below is how we can convert a Feed-Forward Neural Network into a Recurrent Neural Network:
The input layer ‘x’ takes in the input to the neural network and processes it and passes it onto the
middle layer.
The middle layer ‘h’ can consist of multiple hidden layers, each with its own activation functions
and weights and biases. If we have a neural network where the various parameters of different
hidden layers are not affected by the previous layer, ie: the neural network does not have
memory, then we can use a recurrent neural network.
Fig:
Feed-forward Neural Network
In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.
Image Captioning:RNNs are used to caption an image by analysing the activities present.
Time Series Prediction: Any time series problem, like predicting the prices of stocks in a
particular month, can be solved using an RNN.
Natural Language Processing: Text mining and Sentiment analysis can be carried out using an
RNN for Natural Language Processing (NLP).
Ability to Handle Variable-Length Sequences: RNNs are designed to handle input sequences
of variable length, which makes them well-suited for tasks such as speech recognition, natural
language processing, and time series analysis.
Memory of Past Inputs:RNNs have a memory of past inputs, which allows them to capture
information about the context of the input sequence. This makes them useful for tasks such as
language modelling, where the meaning of a word depends on the context in which it appears.
Parameter Sharing: RNNs share the same set of parameters across all time steps, which reduce
the number of parameters that need to be learned and can lead to better generalization.
Non-Linear Mapping: RNNs use non-linear activation functions, which allow them to learn
complex, non-linear mappings between inputs and outputs.
Sequential Processing: RNNs process input sequences sequentially, which makes them
computationally efficient and easy to parallelize.
Flexibility: RNNs can be adapted to a wide range of tasks and input types, including text,
speech, and image sequences.
These advantages make RNNs a powerful tool for sequence modelling and analysis, and have
led to their widespread use in a variety of applications, including natural language processing,
speech recognition, and time series analysis.
Vanishing and Exploding Gradients:RNNs can suffer from the problem of vanishing or
exploding gradients, which can make it difficult to train the network effectively. This occurs
when the gradients of the loss function with respect to the parameters become very small or very
large as they propagate through time.
Lack of Parallelism:RNNs are inherently sequential, which makes it difficult to parallelize the
computation. This can limit the speed and scalability of the network.
Difficulty in Choosing the Right Architecture:There are many different variants of RNNs,
each with its own advantages and disadvantages. Choosing the right architecture for a given task
can be challenging, and may require extensive experimentation and tuning.
These disadvantages are important when deciding whether to use an RNN for a given task.
However, many of these issues can be addressed through careful design and training of the
network and through techniques such as regularization and attention mechanisms.
One-to One
2. One-to-Many
One-to-Many is a type of RNN that gives multiple outputs when given a single input. It takes a
fixed input size and gives a sequence of data outputs. Its applications can be found in Music
Generation and Image Captioning.
One-to-Many
3. Many-to-One
Many-to-One is used when a single output is required from multiple input units or a sequence of
them. It takes a sequence of inputs to display a fixed output. Sentiment Analysis is a common
example of this type of Recurrent Neural Network.
4. Many-to-Many
Many-to-Many are used to generate a sequence of output data from a sequence of input units.
This type of RNN is further divided into the following two subcategories:
1. Equal Unit Size: In this case, the number of both the input and output units is the same. A
common application can be found in Name-Entity Recognition.
RNNs suffer from the problem of vanishing gradients. The gradients carry information used in the
RNN, and when the gradient becomes too small, the parameter updates become insignificant.
This makes the learning of long data sequences difficult.
While training a neural network, if the slope tends to grow exponentially instead of decaying,
this is called an Exploding Gradient. This problem arises when large error gradients accumulate,
resulting in very large updates to the neural network model weights during the training
process.Long training time, poor performance, and bad accuracy are the major issues in gradient
problems.
In a feed-forward neural network, the decisions are based on the current input. It doesn’t
memorize the past data, and there’s no future scope. Feed-forward neural networks are used in
general regression and classification problems.
Bidirectional RNNs:
Bidirectional RNNs are designed to process input sequences in both forward and backward
directions. This allows the network to capture both past and future context, which can be useful
for speech recognition and natural language processing tasks.
Encoder-Decoder RNNs:
Encoder-decoder RNNs consist of two RNNs: an encoder network that processes the input
sequence and produces a fixed-length vector representation of the input and a decoder network
that generates the output sequence based on the encoder's representation. This architecture is
commonly used for sequence-to-sequence tasks such as machine translation.
Attention Mechanisms
Attention mechanisms are a technique that can be used to improve the performance of RNNs on
tasks that involve long input sequences. They work by allowing the network to attend to different
parts of the input sequence selectively rather than treating all parts of the input sequence equally.
This can help the network focus on the input sequence's most relevant parts and ignore irrelevant
information.
These are just a few examples of the many variant RNN architectures that have been developed
over the years. The choice of architecture depends on the specific task and the characteristics of
the input and output sequences.
Encoder-Decoder Model
Encoder
Hidden Vector
Decoder
Encoder-Decoder models are jointly trained to maximize the conditional probabilities of the target
sequence given the input sequence.
In order to fully understand the model’s underlying logic, we will go over the below illustration:
Encoder
Multiple RNN cells can be stacked together to form the encoder. RNN reads each inputs
sequentially
For every timestep (each input) t, the hidden state (hidden vector) h is updated according to
the input at that timestep X[i].
After all the inputs are read by encoder model, the final hidden state of the model represents
the context/summary of the whole input sequence.
Example: Encoder
At the first timestep t1, the previous hidden state h0 will be considered as zero or randomly
chosen. So the first RNN cell will update the current hidden state with the first input and h0.
Each layer outputs two things — updated hidden state and the output for each stage. The
outputs at each stage are rejected and only the hidden states will be propagated to the next
layer.
At second timestep t2, the hidden state h1 and the second input X[2] will be given as input ,
and the hidden state h2 will be updated according to both inputs. Then the hidden state h1
will be updated with the new input and will produce the hidden state h2. This happens for all
the four stages wrt example taken.
A stack of several recurrent units (LSTM or GRU cells for better performance) where each
accepts a single element of the input sequence, collects information for that element, and
propagates it forward.
In the question-answering problem, the input sequence is a collection of all words from the
question. Each word is represented as x_i where i is the order of that word.
Encoder Vector
This is the final hidden state produced from the encoder part of the model. It is calculated
using the formula above.
This vector aims to encapsulate the information for all input elements in order to help the
decoder make accurate predictions.
It acts as the initial hidden state of the decoder part of the model.
Decoder
The Decoder generates the output sequence by predicting the next output Yt given the hidden
state ht.
The input for the decoder is the final hidden vector obtained at the end of encoder model.
Each layer will have three inputs, hidden vector from previous layer ht-1 and the previous
layer output yt-1, original hidden vector h.
At the first layer, the output vector of encoder and the random symbol START, empty hidden
state ht-1 will be given as input, the outputs obtained will be y1 and updated hidden state h1
(the information of the output will be subtracted from the hidden vector).
The second layer will have the updated hidden state h1 and the previous output y1 and
original hidden vector h as current inputs, produces the hidden vector h2 and output y2.
The outputs occurred at each timestep of decoder is the actual output. The model will predict
the output until the END symbol occurs.
A stack of several recurrent units where each predicts an output y_t at a time step t.
In the question-answering problem, the output sequence is a collection of all words from the
answer. Each word is represented as y_i where i is the order of that word.
Example: Decoder.
As you can see, we are just using the previous hidden state to compute the next one.
Output Layer
It is used to produce the probability distribution from a vector of values with the target class
of high probability.
The power of this model lies in the fact that it can map sequences of different lengths to each other.
As you can see the inputs and outputs are not correlated and their lengths can differ. This opens a
whole new range of problems that can now be solved using such architecture.
Applications
Speech recognition
BIDIRECTIONAL RNN
A bi-directional recurrent neural network (Bi-RNN) is a type of recurrent neural network (RNN)
that processes input data in both forward and backward directions. The goal of a Bi- RNN is
to capture the contextual dependencies in the input data by processing it in both directions,
which can be useful in a variety of natural language processing (NLP) tasks.
In a Bi-RNN, the input data is passed through two separate RNNs: one processes the data in the
forward direction, while the other processes it in the reverse direction. The outputs of these two
RNNs are then combined in some way to produce the final output.
One common way to combine the outputs of the forward and reverse RNNs is to concatenate
them, but other methods, such as element-wise addition or multiplication can also be used. The
choice of combination method can depend on the specific task and the desired properties of the
final output.
This means that the network can only use information from earlier time steps when making
predictions at later time steps.
This can be limiting, as the network may not capture important contextual information
relevant to the output prediction.
For example, in natural language processing tasks, a uni-directional RNN may not accurately
predict the next word in a sentence if the previous words provide important context for the
current word.
Consider an example where we could use the recurrent network to predict the masked word in a
sentence.
1. Apple is my favorite .
2. Apple is my favourite , and I work there.
3. Apple is my favorite , and I am going to buy one.
In the first sentence, the answer could be fruit, company, or phone. But in the second and third
sentences, it cannot be a fruit.
A Recurrent Neural Network that can only process the inputs from left to right might not be able
to accurately predict the right answer for sentences discussed above.
To perform well on natural language tasks, the model must be able to process the sequence in
both directions.
Bi-directional RNNs
A bidirectional recurrent neural network (RNN) is a type of recurrent neural network (RNN)
that processes input sequences in both forward and backward directions.
This allows the RNN to capture information from the input sequence that may be relevant to
the output prediction, but the same could be lost in a traditional RNN that only processes the
input sequence in one direction.
This allows the network to consider information from the past and future when making
predictions rather than just relying on the input data at the current time step.
This can be useful for tasks such as language processing, where understanding the context of
a word or phrase can be important for making accurate predictions.
In general, bidirectional RNNs can help improve the performance of a model on a variety of
sequence-based tasks.
These two RNNs are typically referred to as the forward and backward RNNs, respectively.
During the forward pass of the RNN, the forward RNN processes the input sequence in the usual
way by taking the input at each time step and using it to update the hidden state. The updated
hidden state is then used to predict the output at that time step.
Back-propagation through time (BPTT) is a widely used algorithm for training recurrent neural
networks (RNNs). It is a variant of the back-propagation algorithm specifically designed to
handle the temporal nature of RNNs, where the output at each time step depends on the inputs
and outputs at previous time steps.
In the case of a bidirectional RNN, BPTT involves two separate Back-propagation passes: one for
the forward RNN and one for the backward RNN. During the forward pass, the forward RNN
processes the input sequence in the usual way and makes predictions for the output sequence.
These predictions are then compared to the target output sequence, and the error is back-
propagated through the network to update the weights of the forward RNN.
During the backward pass, the backward RNN processes the input sequence in reverse order and
makes predictions for the output sequence. These predictions are then compared to the target
output sequence in reverse order, and the error is back-propagated through the network to update
the weights of the backward RNN.
Once both passes are complete, the weights of the forward and backward RNNs are updated based
on the errors computed during the forward and backward passes, respectively. This process is
repeated for multiple iterations until the model converges and the predictions of the bidirectional
RNN are accurate.
Bidirectional recurrent neural networks (RNNs) can outperform traditional RNNs on various
tasks, particularly those involving sequential data processing. Some examples of tasks where
bidirectional RNNs have been shown to outperform traditional RNNs include:
Natural languages processing tasks, such as language translation and sentiment analysis,
where understanding the context of a word or phrase can be important for making accurate
predictions.
Time series forecasting tasks, such as predicting stock prices or weather patterns, where the
sequence of past data can provide important clues about future trends.
Audio processing tasks, such as speech recognition or music generation, where the
information in the audio signal can be complex and non-linear.
In general, bidirectional RNNs can be useful for any task where the input data has a temporal
structure and where understanding the context of the data is important for making accurate
predictions.
Advantages:
Bidirectional Recurrent Neural Networks (RNNs) have several advantages over traditional
RNNs. Some of the key advantages of bidirectional RNNs include the following:
Improved performance on tasks that involve processing sequential data. Because bidirectional
RNNs can consider information from both past and future time steps when making
However, Bidirectional RNNs also have some disadvantages. Some of the key disadvantages of
bidirectional RNNs include the following:
Increased computational complexity. Because bidirectional RNNs have two separate RNNs
(one for the forward pass and one for the backward pass), they can require more
computational resources to train and evaluate than traditional RNNs. This can make them
more difficult to implement and less efficient in terms of runtime performance.
More difficult to optimize. Because bidirectional RNNs have more parameters (due to the two
separate RNNs), they can be more difficult to optimize. This can make finding the right set of
weights for the model challenging and lead to slower convergence during training.
The need for longer input sequences. For a bidirectional RNN to capture long-term
dependencies in the data, it typically requires longer input sequences than a traditional RNN.
This can be a disadvantage in situations where the input data is limited or noisy, as it may not
be possible to generate enough input data to train the model effectively.
Recursive Neural Networks (RvNNs) are a class of deep neural networks that can learn detailed and
structured information. With RvNN, you can get a structured prediction by recursively applying
the same set of weights on structured inputs. The word recursive indicates that the neural
network is applied to its output.
Due to their deep tree-like structure, Recursive Neural Networks can handle hierarchical data. The
tree structure means combining child nodes and producing parent nodes. Each child-parent bond
has a weight matrix, and similar children have the same weights. The number of children for
every node in the tree is fixed to enable it to perform recursive operations and use the same
weights. RvNNs are used when there's a need to parse an entire sentence.
To calculate the parent node's representation, we add the products of the weight matrices (W_i) and
the children's representations (C_i) and apply the transformation f:
\[h = f \left( \sum_{i=1}^{i=c} W_i C_i \right) \], where c is the number of children.
Recurrent Neural Networks (RNNs) are another well-known class of neural networks used for
processing sequential data. They are closely related to the Recursive Neural Network.
Recurrent Neural Networks represent temporal sequences, which they find application
in Natural language Processing (NLP) since language-related data like sentences and
A Recursive Neural Network is used for sentiment analysis in natural language sentences. It is one
of the most important tasks of Natural language Processing (NLP), which identifies the writing
tone and sentiments of the writer in a particular sentence. If a writer expresses any sentiment,
basic labels about the writing tone are recognized. We want to identify the smaller components
like nouns or verb phrases and order them in a syntactic hierarchy. For example, it identifies
whether the sentence showcases a constructive form of writing or negative word choices.
A variable called 'score' is calculated at each traversal of nodes, telling us which pair of phrases and
words we must combine to form the perfect syntactic tree for a given sentence.
Let us consider the representation of the phrase -- "a lot of fun" in the following sentence. Programming
is a lot of fun.
An RNN representation of this phrase would not be suitable because it considers only sequential
relations. Each state varies with the preceding words' representation. So, a subsequence that
doesn't occur at the beginning of the sentence can't be represented. With RNN, when processing
the word 'fun,' the hidden state will represent the whole sentence.
However, with a Recursive Neural Network (RvNN), the hierarchical architecture can store the
representation of the exact phrase. It lies in the hidden state of the node R_{a\ lot\ of\ fun}. Thus,
Syntactic parsing is completely implemented with the help of Recursive Neural Networks.
The two significant advantages of Recursive Neural Networks for Natural Language
Processing are their structure and reduction in network depth.
As already explained, the tree structure of Recursive Neural Networks can manage
hierarchical data like in parsing problems.
The main disadvantage of recursive neural networks can be the tree structure. Using the tree
structure indicates introducing a unique inductive bias to our model. The bias corresponds to
the assumption that the data follow a tree hierarchy structure. But that is not the truth. Thus,
the network may not be able to learn the existing patterns.
Another disadvantage of the Recursive Neural Network is that sentence parsing can be slow
and ambiguous. Interestingly, there can be many parse trees for a single sentence.
Also, it is more time-consuming and labor-intensive to label the training data for recursive
neural networks than to construct recurrent neural networks. Manually parsing a sentence into
short components is more time-consuming and tedious than assigning a label to a sentence.
Gated Architecture
LSTM used in the field of Deep Learning. It is a variety of recurrent neural networks (RNNs) that
are capable of learning long-term dependencies, especially in sequence prediction problems.
LSTMs are predominantly used to learn, process, and classify sequential data because these
networks can learn long-term dependencies between time steps of data. Common LSTM
applications include sentiment analysis, language modelling, speech recognition, and video
analysis.
LSTM has feedback connections, i.e., it is capable of processing the entire sequence of data, apart
from single data points such as images. This finds application in speech recognition, machine
translation, etc. LSTM is a special kind of RNN, which shows outstanding performance on a
large variety of problems.
The central role of an LSTM model is held by a memory cell known as a ‘cell state’ that maintains
its state over time. The cell state is the horizontal line that runs through the top of the below
diagram. It can be visualized as a conveyor belt through which information just flows,
unchanged.
The sigmoid layer gives out numbers between zero and one, where zero means ‘nothing should
be let through,’ and one means ‘everything should be let through.’
1. Forget Gate(f): At forget gate the input is combined with the previous output to generate
a fraction between 0 and 1, that determines how much of the previous state need to be
preserved (or in other words, how much of the state should be forgotten). This output is
then multiplied with the previous state. Note: An activation output of 1.0 means
“remember everything” and activation output of 0.0 means “forget everything.” From a
different perspective, a better name for the forget gate might be the “remember gate”
2. Input Gate(i): Input gate operates on the same signals as the forget gate, but here the
objective is to decide which new information is going to enter the state of LSTM. The
output of the input gate (again a fraction between 0 and 1) is multiplied with the output of
tan h block that produces the new values that must be added to previous state. This gated
vector is then added to previous state to generate current state
3. Input Modulation Gate(g): It is often considered as a sub-part of the input gate and much
literature on LSTM’s does not even mention it and assume it is inside the Input gate. It is
used to modulate the information that the Input gate will write onto the Internal State Cell
by adding non-linearity to the information and making the information Zero-mean. This
is done to reduce the learning time as Zero-mean input has faster convergence. Although
this gate’s actions are less important than the others and are often treated as a finesse-
4. Calculate the current hidden state by first taking the element-wise hyperbolic tangent of
the current internal cell state vector and then performing element-wise multiplication
with the output gate.
The above-stated working is illustrated as below:-
Note that the blue circles denote element-wise multiplication. The weight matrix W contains
different weights for the current input vector and the previous hidden state for each gate.
LSTMs work in a 3-step process.
The first step in the LSTM is to decide which information should be omitted from the cell in that
particular time step. The sigmoid function determines this. It looks at the previous state (ht-1)
along with the current input xt and computes the function.
ft – forget gate. Decides which information to delete that is not important from previous time
step.
1. Let the output of h(t-1) be “Alice is good in Physics. John, on the other hand, is good at
Chemistry.”
2. Let the current input at x(t) be “John plays football well. He told me yesterday over the phone
that he had served as the captain of his college football team.”
The forget gate realizes there might be a change in context after encountering the first full stop. It
compares with the current input sentence at x(t). The next sentence talks about John, so the
information on Alice is deleted. The position of the subject is vacated and assigned to John.
Step 2: Decide How Much This Unit Adds to the Current State
In the second layer, there are two parts. One is the sigmoid function, and the other is the tanh
function. In the sigmoid function, it decides which values to let through (0 or1). tanh function
gives weightage to the values which are passed, deciding their level of importance (-1 to 1).
it - input gate.Determines which information to let through based on its significance in the
current time step.
With the current input at x(t), the input gate analyses the important information John plays
football, and the fact that he was the captain of his college team is important.
“He told me yesterday over the phone” is less important; hence it's forgotten. This process of
adding some new information can be done via the input gate.
Step 3: Decide What Part of the Current Cell State Makes It to the Output
The third step is to decide what the output will be. First, we run a sigmoid layer, which decides
what parts of the cell state make it to the output. Then, we put the cell state through tanh to push
the values to be between -1 and 1 and multiply it by the output of the sigmoid gate.
Let’s consider this example to predict the next word in the sentence: “John played tremendously
well against the opponent and won for his team. For his contributions, brave was awarded
player of the match.”There could be many choices for the empty space. The current input brave
is an adjective, and adjectives describe a noun. So, “John” could be the best output after brave.
LSTM Applications
As previously explained, using the chain rule, we must keep multiplying terms with the error
gradient as we go backwards. However, in the long chain of multiplication, if we multiply many
things together that are less than one, then the resulting gradient will be very small. Thus, the
gradient becomes very small as we approach the earlier layers in a deep architecture. In
some cases, the gradient becomes zero, meaning that we do not update the early layers at all.
In general, there are two fundamental ways that one could use skip connections through different
non-sequential layers:
The core idea is to back-propagate through the identity function, by just using a vector
addition. Then the gradient would simply be multiplied by one and its value will be maintained
in the earlier layers. This is the main idea behind Residual Networks (ResNets): they stack
these skip residual blocks together. We use an identity function to preserve the gradient.
Mathematically, we can represent the residual block, and calculate its partial derivative
(gradient), given the loss function like this:
Apart from the vanishing gradients, there is another reason that we commonly use them. For a
plethora of tasks (such as semantic segmentation, optical flow estimation, etc.) there is some
information that was captured in the initial layers and we would like to allow the later layers to
also learn from them. It has been observed that in earlier layers the learned features
correspond to lower semantic information that is extracted from the input. If we had not
used the skip connection that information would have turned too abstract.
As stated, for many dense prediction problems, there is low-level information shared between
the input and output, and it would be desirable to pass this information directly across the
net. The alternative way that we can achieve skip connections is by concatenation of previous
feature maps. The most famous deep learning architecture is DenseNet. Below we can see an
example of feature reusability by concatenation with 5 convolutional layers:
Short skip connections are used along with consecutive convolutional layers that do not change
the input dimension (see Res-Net), while long skip connections usually exist in encoder-decoder
architectures. It is known that the global information (shape of the image and other
statistics) resolves what, while local information resolves where (small details in an image
patch).
Skip connections can provide several benefits for CNNs, such as improving accuracy and
generalization, solving the vanishing gradient problem, and enabling deeper networks. Skip
connections can help the network to learn more complex and diverse patterns from the data and
reduce the number of parameters and operations needed by the network. Additionally, skip
connections can help to alleviate the problem of vanishing gradients by providing alternative
paths for the gradients to flow. Furthermore, they can make it easier and faster to train deeper
networks, which have more expressive power and can capture more features from the data.
Skip connections are a popular and powerful technique for improving the performance and
efficiency of CNNs, but they are not a panacea. They can help preserve information and
gradients, combine features, solve the vanishing gradient problem, and enable deeper networks.
However, they can also increase complexity and memory requirements, introduce redundancy
and noise, and require careful design and tuning to match the network architecture and data
domain. Different types and locations of skip connections can have different impacts on the
network performance, with some being more beneficial or harmful than others. Thus, it is
essential to understand how skip connections work and how to use them wisely and effectively
for CNNs.
Dropouts
Dropout refers to data, or noise, that's intentionally dropped from a neural network to improve
processing and time to results. A neural network is software attempting to emulate the actions of
the human brain.
Neural networks are the building blocks of any machine-learning architecture. They consist of
one input layer, one or more hidden layers, and an output layer.
When we training our neural network (or model) by updating each of its weights, it might
become too dependent on the dataset we are using. Therefore, when this model has to make a
prediction or classification, it will not give satisfactory results. This is known as over-fitting. We
might understand this problem through a real-world example: If a student of mathematics
learns only one chapter of a book and then takes a test on the whole syllabus, he will probably
fail.
To overcome this problem, we use a technique that was introduced by Geoffrey Hinton in 2012.
This technique is known as dropout.
We assign ‘p’ to represent the probability of a neuron, in the hidden layer, being excluded from
the network; this probability value is usually equal to 0.5. We do the same process for the input
layer whose probability value is usually lower than 0.5 (e.g. 0.2). Remember, we delete the
connections going into, and out of, the neuron when we drop it.