Unit 3
Unit 3
NETWORKS
Unit 3
Contents
Unfolding Graphs -- RNN Design Patterns: Acceptor --
Encoder --Transducer; Gradient Computation -- Sequence
Modeling Conditioned on Contexts -- Bidirectional RNN --
Sequence to Sequence RNN – Deep Recurrent Networks --
Recursive Neural Networks -- Long Term Dependencies;
Leaky Units: Skip connections and dropouts; Gated
Architecture: LSTM.
• A Computational Graph is a way to formalize the structure of a set of
computations
Such as mapping inputs and parameters to outputs and loss
• We can unfold a recursive or recurrent computation into a
computational graph that has a repetitive structure
Corresponding to a chain of events
• Unfolding this graph results in sharing of parameters across a deep
network structure
Computations
• These can be used for two different types of calculations:
• Forward computation
• Backward computation
• The following sections define a few key terminologies in computational
graphs.
• A variable is represented by a node in a graph. It could be a scalar, vector,
matrix, tensor, or even another type of variable.
• A function argument and data dependency are both represented by an edge.
These are similar to node pointers.
• A simple function of one or more variables is called an operation. There is a
set of operations that are permitted. Functions that are more complex than
these operations in this set can be represented by combining multiple
operations.
Example
Y =H *W +b
t t AY y
• The training of a BRNN is similar to backpropagation through a time
algorithm. BPTT algorithm works as follows:
• Roll out the network and calculate errors at each iteration
• Update weights and roll up the network.
• However, because forward and backward passes in a BRNN occur
simultaneously, updating the weights for the two processes may occur
at the same time. This produces inaccurate outcomes. Thus, the
following approach is used to train a BRNN to accommodate forward
and backward passes individually.
Applications of BiRNN
• Sentiment Analysis: By taking into account both the prior and subsequent
context, BRNNs can be utilized to categorize the sentiment of a particular
sentence.
• Named Entity Recognition: By considering the context both before and
after the stated thing, BRNNs can be utilized to identify those entities in a
sentence.
• Part-of-Speech Tagging: The classification of words in a phrase into their
corresponding parts of speech, such as nouns, verbs, adjectives, etc., can be
done using BRNNs.
• Machine Translation: BRNNs can be used in encoder-decoder models for
machine translation, where the decoder creates the target sentence and the
encoder analyses the source sentence in both directions to capture its
context.
• Speech Recognition: When the input voice signal is processed in both
directions to capture the contextual information, BRNNs can be used in
automatic speech recognition systems.
Advantages of Bidirectional RNN
• Context from both past and future: With the ability to process sequential input both
forward and backward, BRNNs provide a thorough grasp of the full context of a
sequence. Because of this, BRNNs are effective at tasks like sentiment analysis and
speech recognition.
• Enhanced accuracy: BRNNs frequently yield more precise answers since they take
both historical and upcoming data into account.
• Efficient handling of variable-length sequences: When compared to conventional
RNNs, which require padding to have a constant length, BRNNs are better equipped to
handle variable-length sequences.
• Resilience to noise and irrelevant information: BRNNs may be resistant to noise and
irrelevant data that are present in the data. This is so because both the forward and
backward paths offer useful information that supports the predictions made by the
network.
• Ability to handle sequential dependencies: BRNNs can capture long-term links
between sequence pieces, making them extremely adept at handling complicated
sequential dependencies.
Disadvantages of Bidirectional RNN
• Computational complexity: Given that they analyze data both forward and
backward, BRNNs can be computationally expensive due to the increased amount of
calculations needed.
• Long training time: BRNNs can also take a while to train because there are many
parameters to optimize, especially when using huge datasets.
• Difficulty in parallelization: Due to the requirement for sequential processing in
both the forward and backward directions, BRNNs can be challenging to parallelize.
• Over fitting: BRNNs are prone to overfitting since they include many parameters that
might result in too complicated models, especially when trained on short datasets.
• Interpretability: Due to the processing of data in both forward and backward
directions, BRNNs can be tricky to interpret since it can be difficult to comprehend
what the model is doing and how it is producing predictions.
Sequence to Sequence RNN
• In Sequence to Sequence Learning, RNN is trained to map an input sequence to
an output sequence which is not necessarily of the same length.
• Applications are speech recognition, machine translation, image captioning and
question answering.
Two components — an encoder and a decoder.
Encoder :
• Both encoder and the decoder are LSTM models (or sometimes GRU models)
• Encoder reads the input sequence and summarizes the information in something
called the internal state vectors or context vector (in case of LSTM these are
called the hidden state and cell state vectors). We discard the outputs of the
encoder and only preserve the internal states. This context vector aims to
encapsulate the information for all input elements in order to help the decoder
make accurate predictions.
• The hidden states h_i are computed using the formula:
The LSTM reads the data, one sequence after the other. Thus if the input is a sequence
of length ‘t’, we say that LSTM reads it in ‘t’ time steps.
1. Xi = Input sequence at time step i.
2. hi and ci = LSTM maintains two states (‘h’ for hidden state and ‘c’ for cell state) at
each time step. Combined together these are internal state of the LSTM at time step i.
3. Yi = Output sequence at time step i. Yi is actually a probability distribution over the
entire vocabulary which is generated by using a softmax activation. Thus each Yi is a
vector of size “vocab_size” representing a probability distribution.
Decoder :
• The decoder is an LSTM whose initial states are initialized to the final states of the
Encoder LSTM, i.e. the context vector of the encoder’s final cell is input to the first
cell of the decoder network. Using these initial states, the decoder starts generating the
output sequence, and these outputs are also taken into consideration for future outputs.
• A stack of several LSTM units where each predicts an output y_t at a time step t.
• Each recurrent unit accepts a hidden state from the previous unit and produces and
output as well as its own hidden state.
• Any hidden state h_i is computed using the formula: