0% found this document useful (0 votes)
14 views

Unit 3

Uploaded by

Jaya prakash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
14 views

Unit 3

Uploaded by

Jaya prakash
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

RECURRENT NEURAL

NETWORKS
Unit 3
Contents
Unfolding Graphs -- RNN Design Patterns: Acceptor --
Encoder --Transducer; Gradient Computation -- Sequence
Modeling Conditioned on Contexts -- Bidirectional RNN --
Sequence to Sequence RNN – Deep Recurrent Networks --
Recursive Neural Networks -- Long Term Dependencies;
Leaky Units: Skip connections and dropouts; Gated
Architecture: LSTM.
• A Computational Graph is a way to formalize the structure of a set of
computations
Such as mapping inputs and parameters to outputs and loss
• We can unfold a recursive or recurrent computation into a
computational graph that has a repetitive structure
Corresponding to a chain of events
• Unfolding this graph results in sharing of parameters across a deep
network structure
Computations
• These can be used for two different types of calculations:
• Forward computation
• Backward computation
• The following sections define a few key terminologies in computational
graphs.
• A variable is represented by a node in a graph. It could be a scalar, vector,
matrix, tensor, or even another type of variable.
• A function argument and data dependency are both represented by an edge.
These are similar to node pointers.
• A simple function of one or more variables is called an operation. There is a
set of operations that are permitted. Functions that are more complex than
these operations in this set can be represented by combining multiple
operations.
Example

For better understanding, we introduce two variables d and e such that


every operation has an output variable. We now have

We have three operations, addition, subtraction, and multiplication. To create a


computational graph, we create nodes, each of them has different operations along
with input variables. The direction of the array shows the direction of input being
applied to other nodes.
We can find the final output value by initializing input variables and
accordingly computing nodes of the graph.
Computational Graphs in Deep Learning
• Computations of the neural network are organized in terms of a
forward pass or forward propagation step in which we compute the
output of the neural network, followed by a backward pass or
backward propagation step, which we use to compute
gradients/derivatives. Computation graphs explain why it is organized
this way.
• If one wants to understand derivatives in a computational graph, the
key is to understand how a change in one variable brings change on
the variable that depends on it. If a directly affects c, then we want to
know how it affects c. If we make a slight change in the value
of a how does c change? We can term this as the partial derivative of c
with respect to a.
We have to follow chain rule to evaluate partial
derivatives of final output variable with respect to
input variables: a, b, and c. Therefore the
derivatives can be given as :
Types of computational graphs:
Type 1: Static Computational Graphs
• Involves two phases:-
• Phase 1:- Make a plan for your architecture.
• Phase 2:- To train the model and generate predictions, feed it a lot of data.
• The benefit of utilizing this graph is that it enables powerful offline
graph optimization and scheduling. As a result, they should be faster
than dynamic graphs in general.
• The drawback is that dealing with structured and even variable-sized
data is unsightly.
Type 2: Dynamic Computational Graphs
• As the forward computation is performed, the graph is implicitly
defined.
• This graph has the advantage of being more adaptable. The library is
less intrusive and enables interleaved graph generation and evaluation.
The forward computation is implemented in your preferred
programming language, complete with all of its features and
algorithms. Debugging dynamic graphs is simple. Because it permits
line-by-line execution of the code and access to all variables, finding
bugs in your code is considerably easier. If you want to employ Deep
Learning for any genuine purpose in the industry, this is a must-have
feature.
• The disadvantage of employing this graph is that there is limited time
for graph optimization, and the effort may be wasted if the graph does
not change.
RNN Design Patterns: Acceptor-Encoder-Transducer
• Finite State Machines
• Simple, classical way of representing state
• Current state: saves necessary past information
• Example: email address parsing
Deterministic Finite State Machines
Types of State Machines
Sequence Models & Recurrent Neural Networks (RNNs)
• Sequence models are the machine learning models that input or output
sequences of data. Sequential data includes text streams, audio clips,
video clips, time-series data and etc. Recurrent Neural Networks
(RNNs) is a popular algorithm used in sequence models.
Applications of Sequence Models
1. Speech recognition: In speech recognition, an audio clip is given as
an input and then the model has to generate its text transcript. Here both
the input and output are sequences of data.
2. Sentiment Classification: In sentiment classification opinions expressed in a
piece of text is categorized. Here the input is a sequence of words.

3. Video Activity Recognition: In video activity recognition, the model needs to


identify the activity in a video clip. A video clip is a sequence of video frames, therefore
in case of video activity recognition input is a sequence of data.
Bidirectional Recurrent Neural Network
• An architecture of a neural network called a bidirectional recurrent
neural network (BRNN) is made to process sequential data. In order
for the network to use information from both the past and future
context in its predictions, BRNNs process input sequences in both the
forward and backward directions. This is the main distinction between
BRNNs and conventional recurrent neural networks.
• A BRNN has two distinct recurrent hidden layers, one of which
processes the input sequence forward and the other of which processes
it backward. After that, the results from these hidden layers are
collected and input into a prediction-making final layer. Any recurrent
neural network cell, such as Long Short-Term Memory (LSTM) or
Gated Recurrent Unit, can be used to create the recurrent hidden
layers.
• The BRNN functions similarly to conventional recurrent neural networks in the
forward direction, updating the hidden state depending on the current input and the
prior hidden state at each time step. The backward hidden layer, on the other hand,
analyses the input sequence in the opposite manner, updating the hidden state
based on the current input and the hidden state of the next time step.
• Compared to conventional unidirectional recurrent neural networks, the accuracy
of the BRNN is improved since it can process information in both directions and
account for both past and future contexts. Because the two hidden layers can
complement one another and give the final prediction layer more data, using two
distinct hidden layers also offers a type of model regularization.
• In order to update the model parameters, the gradients are computed for both the
forward and backward passes of the back propagation through the time technique
that is typically used to train BRNNs. The input sequence is processed by the
BRNN in a single forward pass at inference time, and predictions are made based
on the combined outputs of the two hidden layers. layers.
Bi-directional Recurrent Neural Network
Working of Bidirectional Recurrent Neural
Network
• Inputting a sequence: A sequence of data points, each represented as a vector
with the same dimensionality, are fed into a BRNN. The sequence might have
different lengths.
• Dual Processing: Both the forward and backward directions are used to process
the data. On the basis of the input at that step and the hidden state at step t-1, the
hidden state at time step t is determined in the forward direction. The input at step
t and the hidden state at step t+1 are used to calculate the hidden state at step t in a
reverse way.
• Computing the hidden state: A non-linear activation function on the weighted
sum of the input and previous hidden state is used to calculate the hidden state at
each step. This creates a memory mechanism that enables the network to
remember data from earlier steps in the process.
• Determining the output: A non-linear activation function is used to determine the
output at each step from the weighted sum of the hidden state and a number of
output weights. This output has two options: it can be the final output or input for
another layer in the network.
• Training: The network is trained through a supervised learning approach where
the goal is to minimize the discrepancy between the predicted output and the
actual output. The network adjusts its weights in the input-to-hidden and hidden-
to-output connections during training through back propagation.
To calculate the output from an RNN unit, we use the following formula:
Ht (Forward) = A(Xt * WXH (forward) + Ht-1 (Forward) * WHH (Forward) + bH (Forward) Ht (Backward)
= A(Xt * WXH (Backward) + Ht+1 (Backward) * WHH (Backward) + bH (Backward)
where, A = activation function, W = weight matrix , b = bias
The hidden state at time t is given by a combination of H (Forward) and t

H (Backward). The output at any given hidden state is :


t

Y =H *W +b
t t AY y
• The training of a BRNN is similar to backpropagation through a time
algorithm. BPTT algorithm works as follows:
• Roll out the network and calculate errors at each iteration
• Update weights and roll up the network.
• However, because forward and backward passes in a BRNN occur
simultaneously, updating the weights for the two processes may occur
at the same time. This produces inaccurate outcomes. Thus, the
following approach is used to train a BRNN to accommodate forward
and backward passes individually.
Applications of BiRNN
• Sentiment Analysis: By taking into account both the prior and subsequent
context, BRNNs can be utilized to categorize the sentiment of a particular
sentence.
• Named Entity Recognition: By considering the context both before and
after the stated thing, BRNNs can be utilized to identify those entities in a
sentence.
• Part-of-Speech Tagging: The classification of words in a phrase into their
corresponding parts of speech, such as nouns, verbs, adjectives, etc., can be
done using BRNNs.
• Machine Translation: BRNNs can be used in encoder-decoder models for
machine translation, where the decoder creates the target sentence and the
encoder analyses the source sentence in both directions to capture its
context.
• Speech Recognition: When the input voice signal is processed in both
directions to capture the contextual information, BRNNs can be used in
automatic speech recognition systems.
Advantages of Bidirectional RNN
• Context from both past and future: With the ability to process sequential input both
forward and backward, BRNNs provide a thorough grasp of the full context of a
sequence. Because of this, BRNNs are effective at tasks like sentiment analysis and
speech recognition.
• Enhanced accuracy: BRNNs frequently yield more precise answers since they take
both historical and upcoming data into account.
• Efficient handling of variable-length sequences: When compared to conventional
RNNs, which require padding to have a constant length, BRNNs are better equipped to
handle variable-length sequences.
• Resilience to noise and irrelevant information: BRNNs may be resistant to noise and
irrelevant data that are present in the data. This is so because both the forward and
backward paths offer useful information that supports the predictions made by the
network.
• Ability to handle sequential dependencies: BRNNs can capture long-term links
between sequence pieces, making them extremely adept at handling complicated
sequential dependencies.
Disadvantages of Bidirectional RNN
• Computational complexity: Given that they analyze data both forward and
backward, BRNNs can be computationally expensive due to the increased amount of
calculations needed.
• Long training time: BRNNs can also take a while to train because there are many
parameters to optimize, especially when using huge datasets.
• Difficulty in parallelization: Due to the requirement for sequential processing in
both the forward and backward directions, BRNNs can be challenging to parallelize.
• Over fitting: BRNNs are prone to overfitting since they include many parameters that
might result in too complicated models, especially when trained on short datasets.
• Interpretability: Due to the processing of data in both forward and backward
directions, BRNNs can be tricky to interpret since it can be difficult to comprehend
what the model is doing and how it is producing predictions.
Sequence to Sequence RNN
• In Sequence to Sequence Learning, RNN is trained to map an input sequence to
an output sequence which is not necessarily of the same length.
• Applications are speech recognition, machine translation, image captioning and
question answering.
Two components — an encoder and a decoder.
Encoder :
• Both encoder and the decoder are LSTM models (or sometimes GRU models)
• Encoder reads the input sequence and summarizes the information in something
called the internal state vectors or context vector (in case of LSTM these are
called the hidden state and cell state vectors). We discard the outputs of the
encoder and only preserve the internal states. This context vector aims to
encapsulate the information for all input elements in order to help the decoder
make accurate predictions.
• The hidden states h_i are computed using the formula:
The LSTM reads the data, one sequence after the other. Thus if the input is a sequence
of length ‘t’, we say that LSTM reads it in ‘t’ time steps.
1. Xi = Input sequence at time step i.
2. hi and ci = LSTM maintains two states (‘h’ for hidden state and ‘c’ for cell state) at
each time step. Combined together these are internal state of the LSTM at time step i.
3. Yi = Output sequence at time step i. Yi is actually a probability distribution over the
entire vocabulary which is generated by using a softmax activation. Thus each Yi is a
vector of size “vocab_size” representing a probability distribution.
Decoder :
• The decoder is an LSTM whose initial states are initialized to the final states of the
Encoder LSTM, i.e. the context vector of the encoder’s final cell is input to the first
cell of the decoder network. Using these initial states, the decoder starts generating the
output sequence, and these outputs are also taken into consideration for future outputs.
• A stack of several LSTM units where each predicts an output y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces and
output as well as its own hidden state.
• Any hidden state h_i is computed using the formula:

• The output y_t at time step t is computed using the formula:


• We calculate the outputs using the hidden state at the current time step together
with the respective weight W(S). Softmax is used to create a probability vector
which will help us determine the final output (e.g. word in the question-answering
problem).
Drawbacks of Encoder-Decoder Models :
There are two primary drawbacks to this architecture, both related to length.
• Firstly, as with humans, this architecture has very limited memory. That final
hidden state of the LSTM, which we call S or W, is where you’re trying to cram
the entirety of the sentence you have to translate.S or W is usually only a few
hundred units (read: floating-point numbers) long — the more you try to force into
this fixed dimensionality vector, the lossier the neural network is forced to be.
Thinking of neural networks in terms of the “lossy compression” they’re required
to perform is sometimes quite useful.
• As a general rule of thumb, the deeper a neural network is, the harder it is to train.
For recurrent neural networks, the longer the sequence is, the deeper the neural
network is along the time dimension.This results in vanishing gradients, where the
gradient signal from the objective that the recurrent neural network learns from
disappears as it travels backward. Even with RNNs specifically made to help
prevent vanishing gradients, such as the LSTM, this is still a fundamental
problem.
Deep Recurrent Networks
• Deep neural networks called recursive neural networks (RvNNs) are
employed in natural language processing. When the same weights are
used again to a structured input to produce a structured prediction, we
get a recursive neural network. Business executives and IT specialists
must comprehend what a recursive neural network is, what it can
achieve, and how it functions
Recursive neural networks (RvNNs)
• Recursive neural networks (RvNNs) are capable of learning organized and
detailed data. By repeatedly using the same set of weights on structured inputs,
RvNN enables you to obtain a structured prediction. Recursive refers to the neural
network's application to its output.
• Recursive neural networks are capable of handling hierarchical data because of
their indepth tree-like structure. In a tree structure, parent nodes are created by
joining child nodes. There is a weight matrix for every child-parent bond, and
comparable children have the same weights. To allow for recursive operations and
the use of the same weights, the number of children for each node in the tree is
fixed. When it's necessary to parse a whole sentence, RvNNs are employed.
• We add the weight matrices' (W i) and children's (C i) products and use the
transformation f to determine the parent node's representation.
Recurrent Neural Network vs. Recursive Neural Networks
• Another well-known family of neural networks for processing sequential
data is recurrent neural networks (RNNs). They are connected to the
recursive neural network in a close way.
• Given that language-related data like sentences and paragraphs are
sequential in nature, recurrent neural networks are useful for representing
temporal sequences in natural language processing (NLP). Chain topologies
are frequently used in recurrent networks. By distributing the weights along
the entire chain length, the dimensionality is maintained.
• Recursive neural networks, on the other hand, work with hierarchical data
models because of their tree structure. The tree may perform recursive
operations and utilize the same weights at each step because each node has
a set number of offspring. Parent representations are created by combining
child representations.
• A feed-forward network is less efficient than a recursive network.
• Recursive networks are just a generalization of recurrent networks because
recurrent networks are recurrent over time.
Recursive Neural Network Implementation
• Sentiment analysis in natural language sentences is performed using a
recursive neural network. Identifying the writing tone and thoughts of
the author in a particular sentence is one of the most crucial jobs of
natural language processing (NLP). Basic characterizations of the
writing tone are understood whenever a writer displays any sentiment.
To organize them in a syntactic hierarchy, we must first recognize the
smaller parts, such as nouns or verb phrases. For instance, it indicates
whether the sentence has a positive writing style or unfavorable word
choices.
• To create the ideal syntactic tree for a particular sentence, we must
combine a specific pair of phrases and words, which is indicated by a
variable called "score," which is generated at each traversal of nodes.
RvNNs for Natural Language Processing: Benefits
• The structure and decrease in network depth of recursive neural
networks are their two main advantages for natural language
processing.
• Recursive Neural Networks' tree structure, as previously mentioned,
can manage hierarchical data, such as in parsing issues.
• The ability for trees to have a logarithmic height is another advantage
of RvNN. A recursive neural network could indeed represent a binary
tree with a height of O(log n) when there are O(n) input words. The
length between the first and the last input elements is shortened as a
result. As a result, the long-term dependence becomes more
manageable and shorter.
RvNNs for Natural Language Processing: Demerits
• The tree structure of recursive neural networks may be their biggest drawback.
Using the tree structure suggests giving our model a special inductive bias. The
bias is consistent with the notion that the data are organized in a tree hierarchy.
But the reality is different. As a result, the network might not be able to pick up on
the current patterns.
• The Recursive Neural Network also has a drawback in that sentence parsing can
be cumbersome and slow. It's interesting that different parse trees can exist for the
same text.
• Additionally, labeling the training data for recursive neural networks takes more
time and effort than building recurrent neural networks. It takes more time and
effort to manually break down a sentence into smaller parts than it does to give it a
label.
Long Short-Term Memory (LSTM)
• RNNs are not good at capturing long-range dependencies. This is
mainly due to the vanishing gradient problem. When training very
deep network gradients or the derivatives decreases exponentially as it
propagates down the layers. This is known as Vanishing Gradient
Problem. These gradients are used to update the weights of neural
networks. When the gradients vanish then the weights will not be
updated. Sometimes it will completely stop the neural network from
training. This vanishing gradient problem is a common issue in very
deep neural networks
How to Overcome?
• To overcome this vanishing gradient problem in RNNs, Long Short-
Term Memory was introduced by Sepp Hochreiter and Juergen
Schmidhuber. LSTM is a modification to the RNN hidden layer.
LSTM has enabled RNNs to remember its inputs over a long period of
time. In LSTM in addition to the hidden state, a cell state is passed to
the next time step.
• LSTM can capture long-range dependencies. It can have memory about previous
inputs for extended time durations. There are 3 gates in an LSTM cell. Memory
manipulations in LSTM are done using these gates. Long short-term memory
(LSTM) utilizes gates to control the gradient propagation in the recurrent
network’s memory.
• Forget Gate: Forget gate removes the information that is no longer useful in the
cell state
• Input Gate: Additional useful information to the cell state is added by input gate
• Output Gate: Additional useful information to the cell state is added by output gate
• This gating mechanism of LSTM has allowed the network to learn the conditions
for when to forget, ignore, or keep information in the memory cell.
• LSTM is a very popular deep learning algorithm for sequence models.
• Apple’s Siri and Google’s voice search are some real-world examples
that have used the LSTM algorithm and it is behind the success of
those applications.
• Recent research has shown how the LSTM algorithm can improve the
performance of the machine learning model.
• LSTM is also used for time-series predictions and text classification
tasks as well.
References
1. Ian Goodfellow, Yoshua Bengio, Aaron Courville, ``Deep Learning'',
MIT Press, 2016.
2. Andrew Glassner, “Deep Learning: A Visual Approach”, No Starch
Press, 2021.

You might also like