RECURRENT NEURAL
NETWORKS(RNNS)
Dr. Gaganpreet Kaur
RNNS
• Recurrent means “coming back of something that existed before”, i.e.
Holds the past and hence name of RNNs which have memory and are
capable of working on sequential data.
Imagine a rocket launched but we don’t know the path it shall
follow???
How do you get ECG??
How about cryptocurrency price predictions???
Time Series Prediction
RNNS
RNNs are a type of deep learning model designed for sequential data, enabling
predictions based on prior inputs. They are used for language translation,
sentiment analysis, and time series forecasting.
Unlike CNNs which take image inputs and work on grid data, RNNs process
sequence of values.
Features:
Sequential Learning: RNNs excel in processing time series and sequential
data.
Memory Mechanism: They maintain context through hidden states,
influencing predictions.
Diverse Applications: Useful in language processing, flood prediction, and
more.
Gradient Challenges: RNNs face vanishing and exploding gradient problems.
RNNS
• Utility in Sequential Data: RNNs are tailored for sequential processing,
making them suitable for real-time applications where past information is
crucial. This characteristic enables them to analyze trends and make
predictions based on historical data.
• Memory and Context: The hidden state mechanism allows RNNs to
remember information over time, crucial for tasks requiring context
understanding, such as natural language processing. This memory aspect
differentiates them from traditional neural networks.
• Architectural Variants: LSTM and GRU architectures are designed to
overcome the limitations of standard RNNs, particularly in learning long-
term dependencies, making them more effective for complex tasks. This
highlights the importance of architecture in achieving better
performance.
HOW DIFFERENT FROM ANNS
• Exploits Parameter sharing.
• Consider “I attended an International Conference in 2019”
or
“ In 2019 I had chance to attend an International Conference”
If I want to know when did I attend an International Conference answer
should be: 2019
But in Fully connected ANNs this would be problem of feature extraction
and varies with position of occurrence but RNN shares the weights and
maintains them over several time sequences.
HOW DIFFERENT FROM ANNS
• Neural Networks (and also Convolutional Networks) is that their API is too
constrained: they accept a fixed-sized vector as input (e.g. an image) and
produce a fixed-sized vector as output (e.g. probabilities of different classes).
• Generally mapping is done using a fixed amount of computational steps (e.g.
the number of layers in the model).
RNNs unlike CNNs and ANNs allow to learn from sequences of vectors
DIFFERENT TYPES FOR
DIFFERENT USES
• ANNs: Used for general Regression and Classification problems.
• CNNs: Used for object detection and image classification.
• Deep Belief Network: Used in healthcare sectors for cancer detection.
• RNN: Used for speech recognition, voice recognition, time series
prediction, and natural language processing.
Does ANN/CNN have parameter
sharing
ONE HOT
ENCODING
RNN
In simplest form RNN is as shown in Fig 1 but can have different architectures as shown in Fig2
Many to one One to many Many
to Many
RNN
Recurrent Neural Networks for the following tasks:
• Regression uses Score calculation
• Classification uses Error calculation using a loss function
RNN as it deals with long sequences with many time-steps, works well with
backpropagation through time (BPTT) works well.
• BPTT is an extension of the standard backpropagation algorithm for RNNs
• Truncated BPTT reduces the computational complexity
• of each parameter update in a Recurrent Neural Networks.
COMPUTATIONAL
GRAPHS FOR
RNNS
A recurrent network with no
outputs as computational
graph. This recurrent network
just processes information
from the input x by
incorporating it into the state
h that is passed forward
through time.
The computational graph to compute the
training loss of a recurrent network that
maps an input sequence of x values to a
corresponding sequence of output ‘o’
values. A loss L measures how far each
‘o’ is from the corresponding training
target y .
RNN
• Different from autoregressive(e.g Google deepmind PixelCNN) models
mostly used in GPTs are Feed forward models unlike RNNs use past output
and input to next step . In RNNs feedback is implemented through hidden
units.
• Autoregressive models are free from vanishing gradient problems.
• Transformer models are not auto regressive and more complex
WHY RNNS FOR SEQUENTIAL
MODELLING
• Most of the data around is sequential in real world is sequential;
• CNNs fail to capture as they work on fixed length vectors while sequential
data can be of varying length
• Sequential data should be able to track long term dependencies-
“Hello how r you? Howz your health now”
• In Sequential data order is important:
The movie is good not bad
vs
The movie is not good but bad
• Requires parameter sharing
COMPONENTS OF RNN
Three dimensions for input:
• Mini-batch size
• Number of columns in our
vector per time-step
• Time-series length
RNNS
RNNs can handle variable sequence lengths
Consider:
How old are you?
or
Please tell how old are you?
Or
May I now your age please
RNNs expand along time series and hence handles variable sequences
RNNS
• RNNs uses parameter sharing:
• Different learning parameters(U,V,W) are same at each time step and are
shared / expanded along time series .At each step learning
comes(depends) on previous output so learning is distributed in time
Consider, “It was sunny
day”
RNNS
RNNS
Unfolding across Time:
At each time step state ‘sn’ contains information from the past
RNNS
Captures differences in sequence order as it learns from immediate
previous word/input so captures localized dependencies of sequence
The movie is good not bad
vs
The movie is not good but bad
RNNS
• RNNs support Non-Linear Mapping
• RNNs use non-linear activation functions, which allows them to learn
complex, non-linear mappings between inputs and outputs.
STANDARDIZATION
• It generally helps to standardize the input data (e.g., zero mean,
unit variance).
• This helps transform the inputs into a range more suitable for the
standard activation functions.
• Standardization helps the relationship between the inputs and the
targets to be as simple and localized as possible for only real-
valued inputs.
• Not used for one-hot (categorical) inputs
LEARNING
• Uses BPTT :Takes derivative of loss function wrt each parameter(U,V,W) and
modify parameters so as to minimize loss
Jn is Target value
BATCH NORMALIZATION
• Following key points explain the intuition behind BN and how it works:
• It consists of adding an operation in the model just before or after the
activation function of each hidden layer.
• This operation simply zero-centers and normalizes each input, then scales
and shifts the result using two new parameter vectors per layer: one for
scaling, the other for shifting.
• In other words, the operation lets the model learn the optimal scale and
mean of each of the layer’s inputs.
• To zero-center and normalize the inputs, the algorithm needs to estimate
each input’s mean and standard deviation.
• It does so by evaluating the mean and standard deviation of the input over
the current mini-batch (hence the name “Batch Normalization”).
CHALLENGES
• Suffer from the vanishing and exploding gradient problem
Source: medium/analytics-vidhya
EXPLODING VS VANISHING
GRADIENTS
• n error gradient is the direction and magnitude calculated during the training of a
neural network that is used to update the network weights in the right direction and by
the right amount.
• In deep networks or recurrent neural networks, error gradients can accumulate during
an update and result in very large gradients. These in turn result in large updates to the
network weights, and in turn, an unstable network. At an extreme, the values of weights
can become so large as to overflow and result in NaN values.
• The explosion occurs through exponential growth by repeatedly multiplying gradients
through the network layers that have values larger than 1.0.
• May lead to avalanche learning
• Solution:
• exploding gradients are still occurring, you can check for and limit the size of gradients
during the training of your network.- gradient clipping
• check the size of network weights and apply a penalty to the networks loss function for
large weight values- weight regularization and often an L1 (absolute weights) or an L2
(squared weights) penalty can be use
• As the backpropagation algorithm advances downwards(or backward)
from the output layer towards the input layer, the gradients often get
smaller and smaller and approach zero which eventually leaves the
weights of the initial or lower layers nearly unchanged. As a result, the
gradient descent never converges to the optimum. This is known as
the vanishing gradients problem.
• May lead to learning stagnation/ saturation
• Solution: Using non-saturating Activation functions: ReLU, Leaky ReLU,
Batch normalization
CHALLENGES
• RNNs can have high Computational Complexity. RNNs can be
computationally expensive to train, especially when dealing with long
sequences. This is because the network has to process each input in
sequence, which can be slow.
• It is difficult to choose the right RNN architecture. There are many
different variants of RNNs, each with its own advantages and
disadvantages. Choosing the right architecture for a given task can be
challenging, and may require extensive experimentation and tuning.
CHALLENGES
• RNN is unable to capture Long-Term Dependencies. RNNs are designed to
capture information about past inputs, but they can struggle to capture
long-term dependencies in the input sequence as past information is
dominated by immediate inputs. This is because the gradients can
become very small as they propagate through time, which can cause the
network to forget important information.
• RNNs lack Of Parallelism due to their inherent sequential nature which
makes them slow. This can limit the speed and scalability of the network.
ACTIVATION FUNCTIONS IN RNN
• Sigmoid Function: It has a range between 0 and 1, which makes it useful
for binary classification tasks. The formula for the sigmoid function is:
σ(x) = 1 / (1 + e^(-x))
• Hyperbolic Tangent (Tanh) Function: It has a range between -1 and 1,
which makes it useful for non-linear classification tasks. The formula for
the tanh function is:
tanh(x) = (e^x - e^(-x)) / (e^x + e^(-x))
ACTIVATION FUNCTIONS IN RNN
• Rectified Linear Unit (Relu) Function: It has a range between 0 and
infinity, which makes it useful for models that require positive outputs.
The formula for the ReLU function is:
ReLU(x) = max(0, x)
• Leaky Relu Function: It introduces a small slope to negative values, which
helps to prevent dead neurons in the model. The formula for the Leaky
ReLU function is:
Leaky ReLU(x) = max(0.01x, x)
ACTIVATION FUNCTIONS IN RNN
• Softmax Function: The softmax function is often used in the output layer
of RNNs for multi-class classification tasks. It converts the network output
into a probability distribution over the possible classes. The formula for
the softmax function is:
softmax(x) = e^x / ∑(e^x)
VARIANTS OF RNN
• Vanilla RNN( single i/p and single O/P), LSTM, Gated Recurrent Unit(GRU),
Bidirectional LSTM
• Single I/P and Single O/p:
• Single I/P many O/P: Image Captioning
• Many I/P and Many O/P: Language Translators
• Many I/P and single O/P: Sentiment Analysis
BACKPROPAGATION IN
TIME:RNNS
BACKPROPAGATION IN
TIME:RNNS
J is the loss
function at each
time step Ji where
‘I’ denotes the
time step
BACKPROPAGATION IN
TIME:RNNS
Applying the chain rule:
So, s2 depends on s1 which in turn
deoends on s0 and both depend on W
So S2 can be further expanded as:
Generalizing the xpression for backpropagation through time steps:
LSTMS
• Overcomes the vanishing gradient problem
• Uses gating mechanisms that control the flow of information through the
network:
• Input gate
• Forget gate
• Output gate.
• use of gates allow the LSTM network to selectively remember or forget
information from the input sequence, which makes it more effective for
long-term dependencies.