0% found this document useful (0 votes)
159 views48 pages

Deep Learning: Feedforward Networks Explained

The document provides an overview of key concepts in deep learning, focusing on feedforward networks, the perceptron algorithm, gradient descent, backpropagation, empirical risk minimization, and regularization techniques. It explains how neurons function within these networks, the structure of multilayer perceptrons, and the importance of weight adjustments during training. Additionally, it covers the applications of feedforward networks and the significance of gradient descent in optimizing neural network performance.

Uploaded by

kitscsework
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
159 views48 pages

Deep Learning: Feedforward Networks Explained

The document provides an overview of key concepts in deep learning, focusing on feedforward networks, the perceptron algorithm, gradient descent, backpropagation, empirical risk minimization, and regularization techniques. It explains how neurons function within these networks, the structure of multilayer perceptrons, and the importance of weight adjustments during training. Additionally, it covers the applications of feedforward networks and the significance of gradient descent in optimizing neural network performance.

Uploaded by

kitscsework
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Unit II

Feedforward Networks- Multilayer Perceptron, Gradient Descent, Backpropagation, Empirical Risk


Minimization, regularization, auto encoders.
Deep Neural Networks: Difficulty of training deep neural networks, Greedy layer wise training.

1. Basic Concepts of Neurons


Neurons in deep learning models are nodes through which data and computations flow.

Neurons work like this:

● They receive one or more input signals. These input signals can come from either the
raw data set or from neurons positioned at a previous layer of the neural net.

● They perform some calculations.

● They send some output signals to neurons deeper in the neural net through a synapse.

Here is a diagram of the functionality of a neuron in a deep learning neural net:

Let’s walk through this diagram step-by-step.

As you can see, neurons in a deep learning model are capable of having synapses that
connect to more than one neuron in the preceding layer. Each synapse has an associated weight,
which impacts the preceding neuron’s importance in the overall neural network.
Weights are a very important topic in the field of deep learning because adjusting a
model’s weights is the primary way through which deep learning models are trained. You’ll see
this in practice later on when we build our first neural networks from scratch.

Once a neuron receives its inputs from the neurons in the preceding layer of the model, it
adds up each signal multiplied by its corresponding weight and passes them on to an activation
function, like this:

The activation function calculates the output value for the neuron. This output value is then
passed on to the next layer of the neural network through another synapse.

This serves as a broad overview of deep learning neurons. Do not worry if it was a lot to take in
– we’ll learn much more about neurons in the rest of this tutorial. For now, it’s sufficient for you
to have a high-level understanding of how they are structured in a deep learning model.

2. Perceptron Algorithm

● A Perceptron is an Artificial Neuron

● It is the simplest possible Neural Network

● Neural Networks are the building blocks of Machine Learning.

The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that
the sum of the values should be greater than a threshold value before making a decision
like true or false (0 or 1).

Perceptron Example

Imagine a perceptron (in your brain).

The perceptron tries to decide if you should go to a concert.

Is the artist good? Is the weather good?

What weights should these facts have?

Algorithm
Frank Rosenblatt suggested this algorithm:

1. Set a threshold value

2. Multiply all inputs with its weights

3. Sum all the results

4. Activate the output


1. Set a threshold value:

● Threshold = 1.5

2. Multiply all inputs with its weights:

● x1 * w1 = 1 * 0.7 = 0.7

● x2 * w2 = 0 * 0.6 = 0

● x3 * w3 = 1 * 0.5 = 0.5

● x4 * w4 = 0 * 0.3 = 0

● x5 * w5 = 1 * 0.4 = 0.4

3. Sum all the results:

● 0.7 + 0 + 0.5 + 0 + 0.4 = 1.6 (The Weighted Sum)

4. Activate the Output:

● Return true if the sum > 1.5 ("Yes I will go to the Concert")

3. Feed Forward – Multilayer Perceptron


Feed forward Neural Networks, also known as Deep feed forward Networks or Multi-
layer Perceptrons, are the focus of this article. For example, Convolutional and Recurrent Neural
Networks (which are used extensively in computer vision applications) are based on these
networks. We’ll do our best to grasp the key ideas in an engaging and hands-on manner without
having to delve too deeply into mathematics.

Search engines, machine translation, and mobile applications all rely on deep learning technologies.
It works by stimulating the human brain in terms of identifying and creating patterns from
various types of input.

A feedforward neural network is a key component of this fantastic technology since it aids software
developers with pattern recognition and classification, non-linear regression, and function
approximation.

What is Feed forward Neural Network?


A feedforward neural network is a type of artificial neural network in which nodes’
connections do not form a loop.

Often referred to as a multi-layered network of neurons, feedforward neural networks are


so named because all information flows in a forward manner only.

The data enters the input nodes, travels through the hidden layers, and eventually exits
the output nodes. The network is devoid of links that would allow the information exiting the
output node to be sent back into the network.

The purpose of feed forward neural networks is to approximate functions.

There is a classifier using the formula y = f* (x).

This assigns the value of input x to the category y.

The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that most
closely approximates the function.

A Feedforward Neural Network’s Layers

The following are the components of a feedforward neural network:

Layer of input

It contains the neurons that receive input. The data is subsequently passed on to the next tier. The
input layer’s total number of neurons is equal to the number of variables in the dataset.

Hidden layer

This is the intermediate layer, which is concealed between the input and output layers. This layer
has a large number of neurons that perform alterations on the inputs. They then communicate
with the output layer

Output layer

It is the last layer and is depending on the model’s construction. Additionally, the output layer is
the expected feature, as you are aware of the desired outcome.

Neurons weights

Weights are used to describe the strength of a connection between neurons. The range of a
weight’s value is from 0 to 1.

Cost Function in Feedforward Neural Network


The cost fun””ction is an important factor of a feedforward neural network. Generally,
minor adjustments to weights and biases have little effect on the categorized data points. Thus, to
determine a method for improving performance by making minor adjustments to weights and
biases using a smooth cost function.

The mean square error cost function is defined as follows:

Where,

w = weights collected in the network

b = biases

n = number of training inputs

a = output vectors

x = input

‖v‖ = usual length of vector v

Loss Function in Feedforward Neural Network


A neural network’s loss function is used to identify if the learning process needs to be
adjusted.

As many neurons as there are classes in the output layer. To show the difference between
the predicted and actual distributions of probabilities.

The cross-entropy loss for binary classification is as follows.

The cross-entropy loss associated with multi-class categorization is as follows:


Applications of Feedforward Neural Network
These neural networks are utilized in a wide variety of applications. Several of them are denoted
by the following area units:

● Physiological feedforward system: Here, feedforward management is exemplified by the


usual preventative control of heartbeat prior to exercise by the central involuntary
system.

● Gene regulation and feedforward: Throughout this, a theme predominates throughout the
famous networks, and this motif has been demonstrated to be a feedforward system for
detecting non-temporary atmospheric alteration.

● Automating and managing machines

● Parallel feedforward compensation with derivative: This is a relatively recent approach


for converting the non-minimum component of an open-loop transfer system into the
minimum part.

4. Gradient Descent
Gradient descent (GD) is an iterative first-order optimisation algorithm used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear
regression). Due to its importance and ease of implementation, this algorithm is usually taught at
the beginning of almost all machine learning courses.

However, its use is not limited to ML/DL only, it’s being widely used also in areas like:

● control engineering (robotics, chemical, etc.)

● computer games

● mechanical engineering

Function requirements

Gradient descent algorithm does not work for all functions. There are two specific
requirements. A function has to be:
● differentiable
● convex

First, what does it mean it has to be differentiable? If a function is differentiable it has a


derivative for each point in its domain — not all functions meet these criteria. First, let’s see
some examples of functions meeting this criterion:

Next requirement — function has to be convex. For a univariate function, this means that
the line segment connecting two function’s points lays on or above its curve (it does not cross it).
If it does it means that it has a local minimum which is not a global one.

Mathematically, for two points x₁, x₂ laying on the function’s curve this condition is
expressed as:
where λ denotes a point’s location on a section line and its value has to be between 0 (left
point) and 1 (right point), e.g. λ=0.5 means a location in the middle.

Another way to check mathematically if a univariate function is convex is to calculate the


second derivative and check if its value is always bigger than 0.

Let’s investigate a simple quadratic function given by:

Because the second derivative is always bigger than 0, our function is strictly convex.

It is also possible to use quasi-convex functions with a gradient descent algorithm. However,
often they have so-called saddle points (called also minimax points) where the algorithm can
get stuck (we will demonstrate it later in the article). An example of a quasi-convex function is:
The value of this expression is zero for x=0 and x=1. These locations are called an inflexion
point — a place where the curvature changes sign — meaning it changes from convex to
concave or vice-versa. By analysing this equation we conclude that :

● for x<0: function is convex

● for 0<x<1: function is concave (the 2nd derivative < 0)

● for x>1: function is convex again

Now we see that point x=0 has both first and second derivative equal to zero meaning this is a
saddle point and point x=1.5 is a global minimum.

Let’s look at the graph of this function. As calculated before a saddle point is at x=0 and
minimum at x=1.5.
For multivariate functions the most appropriate check if a point is a saddle point is to calculate a
Hessian matrix which involves a bit more complex calculations and is beyond the scope of this
article.

Example of a saddle point in a bivariate function is show below.


Gradient Descent Algorithm

Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:

There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.

● The smaller learning rate the longer GD converges, or may reach maximum
iteration before reaching the optimum point

● If learning rate is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.

In summary, Gradient Descent method’s steps are:

1. choose a starting point (initialisation)

2. calculate gradient at this point

3. make a scaled step in the opposite direction to the gradient (objective: minimise)

4. repeat points 2 and 3 until one of the criteria is met:

● maximum number of iterations reached

● step size is smaller than the tolerance (due to scaling or a small gradient).

5. Back Propagation Networks


Backpropagation is the essence of neural network training. It is the method of fine-
tuning the weights of a neural network based on the error rate obtained in the previous epoch
(i.e., iteration). Proper tuning of the weights allows you to reduce error rates and make the model
reliable by increasing its generalization.

Backpropagation in neural network is a short form for “backward propagation of errors.”


It is a standard method of training artificial neural networks. This method helps calculate the
gradient of a loss function with respect to all the weights in the network.

How Backpropagation Algorithm Works


The Back propagation algorithm in neural network computes the gradient of the loss
function for a single weight by the chain rule. It efficiently computes one layer at a time, unlike a
native direct computation. It computes the gradient, but it does not define how the gradient is
used. It generalizes the computation in the delta rule.

Consider the following Back propagation neural network example diagram to understand:

1. Inputs X, arrive through the preconnected path

2. Input is modeled using real weights W. The weights are usually randomly selected.

3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.

4. Calculate the error in the outputs

5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.

Keep repeating the process until the desired output is achieved


Why We Need Backpropagation?

Most prominent advantages of Backpropagation are:

● Backpropagation is fast, simple and easy to program

● It has no parameters to tune apart from the numbers of input

● It is a flexible method as it does not require prior knowledge about the network

● It is a standard method that generally works well

● It does not need any special mention of the features of the function to be learned.

Types of Backpropagation Networks

Two Types of Backpropagation Networks are:

● Static Back-propagation

● Recurrent Backpropagation

Static back-propagation:

It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character recognition.

Recurrent Backpropagation:

Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.

The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.

Backpropagation in neural network can be explained with the help of “Shoe Lace” analogy

Too little tension =

● Not enough constraining and very loose

Too much tension =

● Too much constraint (overtraining)


● Taking too much time (relatively slow process)

● Higher likelihood of breaking

Pulling one lace more than other =

● Discomfort (bias)

Disadvantages of using Backpropagation

● The actual performance of backpropagation on a specific problem is dependent on the


input data.

● Back propagation algorithm in data mining can be quite sensitive to noisy data

● You need to use the matrix-based approach for backpropagation instead of mini-batch.

6. Empirical Risk Minimization


Empirical risk minimization (ERM): It is a principle in statistical learning theory which
defines a family of learning algorithms and is used to give theoretical bounds on their
performance. The idea is that we don’t know exactly how well an algorithm will work in practice
(the true "risk") because we don't know the true distribution of data that the algorithm will work
on but as an alternative we can measure its performance on a known set of training data.

Refer PDF

7. Regularization
Regularization is a set of techniques that can prevent overfitting in neural networks and
thus improve the accuracy of a Deep Learning model when facing completely new data from the
problem domain. In this article, we will address the most popular regularization techniques
which are called L1, L2, and dropout.

One of the most important aspects when training neural networks is avoiding overfitting.

What is Regularization?

Simple speaking: Regularization refers to a set of different techniques that lower the
complexity of a neural network model during training, and thus prevent the overfitting.
There are three very popular and efficient regularization techniques called L1, L2, and
dropout which we are going to discuss in the following.

L2 Regularization

The L2 regularization is the most common type of all regularization techniques and is
also commonly known as weight decay or Ride Regression.

The mathematical derivation of this regularization, as well as the mathematical


explanation of why this method works at reducing overfitting, is quite long and complex. Since
this is a very practical article I don’t want to focus on mathematics more than it is required.
Instead, I want to convey the intuition behind this technique and most importantly how to
implement it so you can address the overfitting problem during your deep learning projects.

During the L2 regularization the loss function of the neural network as extended by a so-
called regularization term, which is called here Ω.

The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix. The regularization
term is weighted by the scalar alpha divided by two and added to the regular loss function that is
chosen for the current task. This leads to a new expression for the loss function:

Alpha is sometimes called as the regularization rate and is an additional hyper parameter
we introduce into the neural network. Simply speaking alpha determines how much we
regularize our model.

In the next step we can compute the gradient of the new loss function and put the
gradient into the update rule for the weights:
Some reformulations of the update rule lead to the expression which very much looks like
the update rule for the weights during regular gradient descent:

The only difference is that by adding the regularization term we introduce an additional
subtraction from the current weights (first term in the equation).

In other words independent of the gradient of the loss function we are making our
weights a little bit smaller each time an update is performed.

L1 Regularization

In the case of L1 regularization (also knows as Lasso regression), we simply use another
regularization term Ω. This term is the sum of the absolute values of the weight parameters in a
weight matrix:
Dropout

In addition to the L2 and L1 regularization, another famous and powerful regularization


technique is called the dropout regularization. The procedure behind dropout regularization is
quite simple.

In a nutshell, dropout means that during training with some probability P a neuron of the
neural network gets turned off during training. Let’s look at a visual example.

Assume on the left side we have a feedforward neural network with no dropout. Using
dropout with let’s say a probability of P=0.5 that a random neuron gets turned off during training
would result in a neural network on the right side.

In this case, you can observe that approximately half of the neurons are not active and are
not considered as a part of the neural network. And as you can observe the neural network
becomes simpler.
A simpler version of the neural network results in less complexity that can reduce
overfitting. The deactivation of neurons with a certain probability P is applied at each forward
propagation and weight update step.

8. Autoencoders
What is an autoencoder?
An autoencoder is a type of artificial neural network used to learn data encodings in an
unsupervised manner.

The aim of an autoencoder is to learn a lower-dtimensional representation (encoding) for


a higher-dimensional data, typically for dimensionality reduction, by training the network to
capture the most important parts of the input image.

The architecture of autoencoders


Let’s start with a quick overview of autoencoders’ architecture.

Autoencoders consist of 3 parts:

1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.

2. Bottleneck: A module that contains the compressed knowledge representations and is


therefore the most important part of the network.

3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground
truth.

The architecture as a whole looks something like this:


The relationship between the Encoder, Bottleneck, and Decoder

Encoder

The encoder is a set of convolutional blocks followed by pooling modules that compress
the input to the model into a compact section called the bottleneck.

The bottleneck is followed by the decoder that consists of a series of upsampling modules
to bring the compressed feature back into the form of an image. In case of simple autoencoders,
the output is expected to be the same as the input with reduced noise.

However, for variational autoencoders it is a completely new image, formed with


information the model has been provided as input.

Bottleneck

The most important part of the neural network, and ironically the smallest one, is the
bottleneck. The bottleneck exists to restrict the flow of information to the decoder from the
encoder, thus,allowing only the most vital information to pass through.

Since the bottleneck is designed in such a way that the maximum information possessed
by an image is captured in it, we can say that the bottleneck helps us form a knowledge-
representation of the input.
Thus, the encoder-decoder structure helps us extract the most from an image in the form
of data and establish useful correlations between various inputs within the network.

A bottleneck as a compressed representation of the input further prevents the neural


network from memorising the input and overfitting on the data.

As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of
overfitting. However Very small bottlenecks would restrict the amount of information storable,
which increases the chances of important information slipping out through the pooling layers of
the encoder.

Decoder

Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the
bottleneck's output.

Since the input to the decoder is a compressed knowledge representation, the decoder
serves as a “decompressor” and builds back the image from its latent attributes.

How to train autoencoders?


You need to set 4 hyperparameters before training an autoencoder:

1. Code size: The code size or the size of the bottleneck is the most important
hyperparameter used to tune the autoencoder. The bottleneck size decides how much the
data has to be compressed. This can also act as a regularisation term.

2. Number of layers: Like all neural networks, an important hyperparameter to tune


autoencoders is the depth of the encoder and the decoder. While a higher depth increases
model complexity, a lower depth is faster to process.

3. Number of nodes per layer: The number of nodes per layer defines the weights we use
per layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.

4. Reconstruction Loss: The loss function we use to train the autoencoder is highly
dependent on the type of input and output we want the autoencoder to adapt to. If we are
working with image data, the most popular loss functions for reconstruction are MSE
Loss and L1 Loss. In case the inputs and outputs are within the range [0,1], as in
MNIST, we can also make use of Binary Cross Entropy as the reconstruction loss.
5 types of autoencoders
The idea of autoencoders for neural networks isn't new.

The first applications date to the 1980s. Initially used for dimensionality reduction and
feature learning, an autoencoder concept has evolved over the years and is now widely used for
learning generative models of data.

Here are five popular autoencoders that we will discuss:

1. Undercomplete autoencoders

2. Sparse autoencoders

3. Contractive autoencoders

4. Denoising autoencoders

5. Variational Autoencoders (for generative modelling)

1. Undercomplete Autoencoders
1. An undercomplete autoencoder is one of the simplest types of autoencoders.
2. The way it works is very straightforward—
3. Undercomplete autoencoder takes in an image and tries to predict the same image as
output, thus reconstructing the image from the compressed bottleneck region.
4. Undercomplete autoencoders are truly unsupervised as they do not take any form of
label, the target being the same as the input.
5. The primary use of autoencoders like such is the generation of the latent space or the
bottleneck, which forms a compressed substitute of the input data and can be easily
decompressed back with the help of the network when needed.
6. This form of compression in the data can be modeled as a form of dimensionality
reduction.
2. Sparse Autoencoders

Sparse autoencoders are similar to the undercomplete autoencoders in that they use the
same image as input and ground truth. However—

The means via which encoding of information is regulated is significantly different.


While undercomplete autoencoders are regulated and fine-tuned by regulating the size of
the bottleneck, the sparse autoencoder is regulated by changing the number of nodes at each
hidden layer.

Since it is not possible to design a neural network that has a flexible number of nodes at
its hidden layers, sparse autoencoders work by penalizing the activation of some neurons in
hidden layers.

In other words, the loss function has a term that calculates the number of neurons that
have been activated and provides a penalty that is directly proportional to that.

This penalty, called the sparsity function, prevents the neural network from activating
more neurons and serves as a regularizer.

While typical regularizers work by creating a penalty on the size of the weights at the
nodes, sparsity regularizer works by creating a penalty on the number of nodes activated.

This form of regularization allows the network to have nodes in hidden layers dedicated
to find specific features in images during training and treating the regularization problem as a
problem separate from the latent space problem.

We can thus set latent space dimensionality at the bottleneck without worrying about
regularization.

There are two primary ways in which the sparsity regularizer term can be incorporated into
the loss function.

1. L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general
regularizers:
2. KL-Divergence: In this case, we consider the activations over a collection of samples at
once rather than summing them as in the L1 Loss method. We constrain the average
activation of each neuron over this collection.

Considering the ideal distribution as a Bernoulli distribution, we include KL divergence


within the loss to reduce the difference between the current distribution of the activations and the
ideal (Bernoulli) distribution:

3. Contractive Autoencoders

Similar to other autoencoders, contractive autoencoders perform task of learning a


representation of the image while passing it through a bottleneck and reconstructing it in the
decoder.

The contractive autoencoder also has a regularization term to prevent the network from
learning the identity function and mapping input into the output.

Contractive autoencoders work on the basis that similar inputs should have similar
encodings and a similar latent space representation. It means that the latent space should not vary
by a huge amount for minor variations in the input.

To train a model that works along with this constraint, we have to ensure that the
derivatives of the hidden layer activations are small with respect to the input.

Mathematically:

should be as small as possible.


An important thing to note in the loss function (formed from the norm of the derivatives
and the reconstruction loss) is that the two terms contradict each other.

While the reconstruction loss wants the model to tell differences between two inputs and
observe variations in the data, the frobenius norm of the derivatives says that the model should
be able to ignore variations in the input data.

Putting these two contradictory conditions into one loss function enables us to train a
network where the hidden layers now capture only the most essential information. This
information is necessary to separate images and ignore information that is non-discriminatory in
nature, and therefore, not important.

The total loss function can be mathematically expressed as:

The gradient is summed over all training samples, and a frobenius norm of the same is taken.

4. Denoising Autoencoders

Denoising autoencoders, as the name suggests, are autoencoders that remove noise from
an image. As opposed to autoencoders we’ve already covered, this is the first of its kind that
does not have the input image as its ground truth.

In denoising autoencoders, we feed a noisy version of the image, where noise has been
added via digital alterations. The noisy image is fed to the encoder-decoder architecture, and the
output is compared with the ground truth image.
The denoising autoencoder gets rid of noise by learning a representation of the input
where the noise can be filtered out easily.

While removing noise directly from the image seems difficult, the autoencoder performs
this by mapping the input data into a lower-dimensional manifold (like in undercomplete
autoencoders), where filtering of noise becomes much easier.

Essentially, denoising autoencoders work with the help of non-linear dimensionality


reduction. The loss function generally used in these types of networks is L2 or L1 loss.

5. Variational Autoencoders

Standard and variational autoencoders learn to represent the input just in a compressed
form called the latent space or the bottleneck.

Therefore, the latent space formed after training the model is not necessarily continuous
and, in effect, might not be easy to interpolate.

For example—

This is what a variational autoencoder would learn from the input:

While these attributes explain the image and can be used in reconstructing the image
from the compressed latent space, they do not allow the latent attributes to be expressed in a
probabilistic fashion.

Variational autoencoders deal with this specific topic and express their latent attributes as
a probability distribution, leading to the formation of a continuous latent space that can be easily
sampled and interpolated.

When fed the same input, a variational autoencoder would construct latent attributes in
the following manner:
The latent attributes are then sampled from the latent distribution formed and fed to the
decoder, reconstructing the input.

The motivation behind expressing the latent attributes as a probability distribution can be
very easily understood via statistical expressions.

Applications of autoencoders

1. Dimensionality reduction
2. Image denoising
3. Generation of image and time series data
4. Anomaly Detection

Deep Neural Networks: Difficulty of training deep neural networks, Greedy layer wise training.

Difficulty in Training Deep Neural Networks

Training deep learning models is a crucial part of applying this powerful technology to a wide range of
tasks. However, training a model involves a lot of challenges from overfitting and underfitting to slow
convergence and vanishing gradients; many factors can impact the performance and reliability of a deep
learning model. Understanding these issues and how to mitigate them makes it possible to achieve better
results and more robust models.

Network Compression

There is an increasing demand for computing power and storage. With that in mind, building higher
efficiency models optimized for more performance with lesser computations is important. Here is where
compression kicks in to give a better performance to computations ratio. A few methods for Network
compression include,

 Parameter Pruning And Sharing - Reducing redundant parameters which do not affect the
performance.
 Low-Rank Factorisation - Matrix decomposition to obtain informative parameters of CNN.
 Compact Convolutional Filters - A special Kernel with reduced parameters to save storage and
computation space.
 Knowledge Distillation - Train a compact model to reproduce a complex one.

Pruning

Pruning is the method of reducing the number of parameters by removing redundant or insensitive
neurons. There are two methods to prune

 Pruning by weights involves removing individual weights from the network that are found to be
unnecessary or redundant. This can be done using a variety of methods, such as setting small
weights to zero, using magnitude-based pruning, or using functional pruning. Pruning by weights
can help to reduce the size of the network and improve its efficiency, but it can also reduce the
capacity of the network and may lead to a loss of performance. This keeps the architecture of the
model the same.

 Pruning by neurons involves removing entire neurons or groups of neurons from the network that
are found to be unnecessary or redundant. This can be done using a variety of methods, such as
using importance scores to identify and remove less important neurons or using evolutionary
algorithms to evolve smaller networks.

Reduce the Scope of Data Values

In a Neural Network, the weights, biases, and other parameters are initialized to take up 32-bit of
information. 32-bit variables add precision to the model by adding more values after the decimal point. But
in practical applications, reducing the precision from 32-bit to 16-bit floating point does not change the
model's output.

Suppose we have a simple neural network with one input, one hidden layer with two neurons, and one
output. The weights and biases of the network are initialized using 32-bit floating point values. The
network is trained on a dataset and can achieve a certain level of accuracy.
Now, we want to improve the efficiency of the network by reducing the precision of the weights and biases
from 32-bit to 16-bit floating point values. We can do this by simply casting the 32-bit values to 16-bit
values and using them to initialize the network.

During training, we find that the network can achieve the same level of accuracy with the reduced
precision weights and biases as it did with the full precision weights and biases. This means that we were
able to improve the efficiency of the network without sacrificing performance.

Overall, reducing the precision of the weights and biases in a neural network can be a useful technique for
improving efficiency without sacrificing performance, but it is important to consider the trade-offs involved
carefully and to carefully test the impact of reducing precision on the final model.

Bilinear CNN

A Bi-Linear CNN architecture solves the problem of image recognition in fine-grained image datasets.

A Bilinear CNN contains 2(or sometimes more) CNN feature extractors, which identify different features.
The different feature extractors are combined as a Bilinear Vector to find the relationship between the
different features. This is passed through a classifier to obtain the results. For example, If the task was to
recognize a bird, One feature extractor would identify the tail, while the other would identify the beak.
These two will then come together to infer if the image presented to it is a bird at the bilinear vector.

Vanishing/Exploding Gradients Problems


The vanishing/exploding gradient problem occurs during backpropagation. In Backpropagation, we try to
find the gradients of the loss function concerning the weights. During the backpropagation algorithm, we
will be required to multiply values numerous times depending on the number of layers. If the value of
weights is small(lesser than 1), the backpropagation method will make the value of the weights smaller
and smaller until it becomes almost zero. This is the vanishing gradient problem.

Exploding gradient problem is similar but with very large weight values. If the weights become too big,
the backpropagation algorithm makes them bigger and bigger, making it difficult, sometimes impossible, to
compute and train.

The reason for the Vanishing gradient is the Tanh and Sigmoid Activation function. Tanh activation
function contains values between -1 and 1. At the same time, the Sigmoid activation function contains
values between 0 and 1. These two activation functions map any value between small ranges, thus creating
the vanishing gradient problem.

Non-saturating Activation Function

One problem that can occur with activation functions is non-saturation, which refers to the inability of the
activation function to saturate, or reach its maximum output value. This can lead to several issues, such as
the inability of the network to learn, poor generalization, and slow convergence.

There are several common activation functions that can suffer from non-saturation, including the linear
activation function and the sigmoid activation function. These activation functions can saturate when the
input to the activation function is very large or very small, but they may not saturate for intermediate input
values. This can make it difficult for the network to learn and can lead to slow convergence.
Suppose we have a simple neural network with one input, one hidden layer with two neurons, and one
output. The hidden layer uses the sigmoid activation function, which is defined as:

Where x is the input to the activation function.

The input to the first hidden neuron is -5, and the input to the second hidden neuron is 5. The output of the
first hidden neuron will be very close to 0 (since e(−5)e(−5) is very small), and the output of the second
hidden neuron will be very close to 1 (since e(5)e(5) is very large).

This means that the sigmoid activation function has saturated for both of these input values, and the neural
network will not be able to learn effectively because the gradient of the activation function will be very
close to 0 for these inputs. This can lead to slow convergence and poor generalization.

Using an activation function that is more effective at saturating, such as the ReLU, can help to avoid this
issue and improve the performance of the neural network.

Batch Normalization
Batch Normalization is a method of Normalizing the data after every activation function. Normalizing
means bringing every value of data within the same range. Keeping the range of data small equals faster
training time. It also helps with the problem of internal co-variate shift, which means during
backpropagation, each neuron tries to minimize its loss concerning the result obtained in the previous layer.
The result of the previous layer might change in the next iteration, and this problem gets amplified with
deeper layers. So it feels like chasing a moving target. This problem is referred to as the internal covariate
shift. Batch Normalization helps minimize this problem as well.

Gradient Clipping

Gradient clipping is a very useful technique to overcome the exploding gradient problem. In this method,
the gradients are limited to a threshold and made sure it doesn't exceed the value. Keeping the gradients in
check helps to escape the exploding gradient problem. It is especially useful in LSTMs where Exploding
Gradient might occur due to the Tanh and Sigmoid Activation functions. There are two methods to
implement gradient clipping.

 Clipping by value: The gradients are given a min and max value. The max or min bound value is
taken if the gradient exceeds the bounds.
 Norm clipping: All the gradients are clipped by a certain value to always stay below the norm
value.
Greedy Layerwise Pre-training
1. To solve the vanishing gradient descent problem we use this Greedy Layerwise Pre-
training technique.
2. Greedy Layerwise Pre-training is called greedy because it’s greedy algorithm is
used. Greedy algorithms break a problem into many components, then solve for the
optimal version of each component in isolation.
3. Lets us see the mechanism of a greedy layer-wise pretraining method.
4. First, we make a base model of the input and output layer; later, we train the model using
the available dataset.
5. After training the model, we remove the output layer and store it in another variable.
Add a new hidden layer in the model that will be the first hidden layer of the model and
re-add the output layer in the model.
6. Now there are three layers in the model, the input layer, the hidden layer1, and the output
layer, and once again, train the model after inserting the hidden layer1. To add one more
hidden layer, remove the output layer set all the layers as non-trainable(no further change
in weights of the input layer and hidden layer1).
7. Now insert the new hidden layer2 in the model and re-add the output layer. Train the
model after inserting the new hidden layer
8. The model structure will be in the following order, input layer, hidden layer1, hidden
layer2, output layer.
9. Repeat the above steps for every new hidden layer you want to add. (each time you
insert a new hidden layer, perform training on the model using the same dataset)
Fig. Training a 4-layer network

Ggreedy layer wise un supervised pretraining algorithm.

Greedy Layer-Wise Unsupervised Pretraining relies on single-layer representation learning


algorithm.
Each layer is pretrained using unsupervised learning, taking the output of previous layer and
producing as output a new representation of the data, whose distribution is hopefully simpler.

Algorithm: Greedy Layer-wise Unsupervised Pretraining

1. Given unsupervised feature learning algorithm L, Which takes as input a training set of examples
and returns an encoder or feature function f
2. Raw input data is X, with one row per example, f (1) (x) is output of the first stage encoder on X
3. In the case where fine tuning is performed we use a learner T which takes an initial function f,
input examples X (and in the supervised fine-tuning case, associated targets Y) and returns a
tuned function. The no of stages is m
Greedy Layer-wise Unsupervised Pretraining algorithm

Transfer Learning and Domain Adaption


Transfer Learning is a research problem in machine learning that focuses on storing knowledge
gained while solving one problem and applying it to a different but related problem. For
example, knowledge gained while learning to recognize cars could apply when trying to
recognize trucks.

Transfer learning, used in machine learning, is the reuse of a pre-trained model on a new
problem. In transfer learning, a machine exploits the knowledge gained from a previous task to
improve generalization about another

• Conventional machine learning and deep learning algorithms, so far, have been
traditionally designed to work in isolation these algorithms are trained to solve specific
tasks. The models have to be rebuilt from scratch once the feature-space distribution changes.
• Traditional learning is isolated and occurs purely based on specific tasks, datasets and
training separate isolated models on them. No knowledge is retained which can be transferred
from one model to another

• Transfer learning is the idea of overcoming the isolated learning paradigm and utilizing
knowledge acquired for one task to solve related ones.

• In transfer learning, you can leverage knowledge (features, weights etc) from previously
trained models for training newer models and even tackle problems like having less data for
the newer task

Transfer learning to overcome the limitations of traditional machine learning models.

1. Traditional machine learning models require training from scratch, which is


computationally expensive and requires a large amount of data to achieve high
performance. On the other hand, transfer learning is computationally efficient and helps
achieve better results using a small data set.
2. Transfer learning models achieve optimal performance faster than the traditional ML models.
It is because the models that leverage knowledge (features, weights, etc.) from previously
trained models already understand the features.

Transfer Learning Strategies


There are different transfer learning strategies and techniques, which can be applied
based on the domain, task at hand, and the availability of data.
1. Inductive Transfer learning
2. Unsupervised Transfer Learning
3. Transductive Transfer Learning

Inductive Transfer learning:


• In this scenario, the source and target domains are the same.
• yet the source and target tasks are different from each other.
• The algorithms try to utilize the inductive biases of the source domain to help improve
the target task.
• Depending upon whether the source domain contains labeled data or not, this can be
further divided into two subcategories, similar to multitask learning and self-taught
learning, respectively.

Unsupervised Transfer Learning:


• This setting is similar to inductive transfer itself, with a focus on unsupervised tasks in
the target domain.
• The source and target domains are similar, but the tasks are different.

Transductive Transfer Learning:


• In this scenario, there are similarities between the source and target tasks.
• but the corresponding domains are different.
• In this setting, the source domain has a lot of labeled data, while the target domain
has none.

Types of Deep Transfer Learning

Multitask learning

Multitask learning is a slightly different flavor of the transfer learning world.

• In the case of multitask learning, several tasks are learned simultaneously without
distinction between the source and targets.

• In this case, the learner receives information about multiple tasks at once, as compared to
transfer learning, where the learner initially has no idea about the target task.

• This helps to transfer knowledge from each scenario and develop a rich combined feature
vector from all the varied scenarios of the same domain. The learner optimizes the
learning/performance across all of the n tasks through some shared knowledge.
One-shot Learning

• Deep learning systems are data-hungry by nature, such that they need many training
examples to learn the weights. This is one of the limiting aspects of deep neural
networks.

• One-shot learning is a variant of transfer learning, where we try to infer the required
output based on just one or a few training examples. This is essentially helpful in real-
world scenarios where it is not possible to have labeled data for every possible class.

Zero-shot Learning

• Zero-shot learning is another extreme variant of transfer learning, which relies on no


labeled examples to learn a task.

• Zero-data learning or zero-short learning methods make clever adjustments during the
training stage itself to exploit additional information to understand unseen data.

Distributed Representation
1. The concept of distributed representations is often central to deep learning, particularly as
it applies to natural language tasks.
2. Distributed representations are a principled way of representing entities (say, cats
or dogs) in terms of vectors.
3. Entities sharing common properties have vector representations that are nearer to
each other.

To examine different types of representation, we can do a simple thought exercise.


Let’s say we have a bunch of “memory units” to store information about shapes.
We can choose to represent each individual shape with a single memory unit, as
demonstrated in Figure 1.
This non-distributed representation, referred to as “sparse” or “local,” is inefficient in
multiple ways.
First, the dimensionality of our representation will grow as the number of shapes we observe
grows. More importantly, it doesn’t provide any information about how these shapes relate to
each other.
This is the true value of a distributed representation: its ability to capture meaningful “semantic
similarity” between data through concepts.

Figure 1. Sparse or local, non-distributed representation of shapes

Figure 2. Shows a distributed representation of this same set of shapes where information about
the shape is represented with multiple “memory units” for concepts related to orientation and
shape.

Now the “memory units” contain information both about an individual shape and how each
shape relates to each other.
Figure 2. Distributed representation of shapes

When we come across a new shape with our distributed representation, such as the circle in
Figure 3, we don’t increase the dimensionality and we also know some information about the
circle, as it relates to the other shapes, even though we haven’t seen it before.

Figure 3. Distributed representation of a circle; this representation is more useful as it provides us with
information about how this new shape is related to our other shapes.

While this shape example is oversimplified, it serves as a great high-level, abstract introduction
to distributed representations.

Notice, in the case of our distributed representation for shapes, that we selected four concepts or
features (vertical, horizontal, rectangle, ellipse) for our representation.
In this case, we were required to know what these important and distinguishing features were
beforehand, and in many cases, this is a difficult or impossible thing to know. It is for this reason
that feature engineering is such a crucial task in classical machine learning techniques.

Finding a good representation of our data is critical to the success of downstream tasks like
classification or clustering. One of the reasons that deep learning has seen tremendous success is
a neural networks’ ability to learn rich distributed representations of data.

Example
Distributed representation is a principled way of representing entities (say, cats or dogs) in terms of
vectors.

Entities sharing common properties have vector representations that are nearer to each other.

Numeric Representations and the Role of Distributed representation

The input and output of the Machine Learning (ML) models are often numeric. This requires
finding a suitable numeric representation of text. Consider that the following sentences are used
to train an ML model.

 He is a King.
 King is a man.
 Queen is a woman.
 She is a Queen.
 King and Queen are rulers.

For the words to be fed as an input to the model, it needs a mathematical representation. One-Hot
encoding of words to vectors is one way to get this representation. The dimension of each vector
is equal to the number of unique words in all the sentences. This collection of unique words is
referred to as vocabulary.

Using one-hot representation, we have,


King represented as [0 1 0 0 0 0 0] and

Queen represented as [0 0 0 1 0 0 0].

This is an example of local representation of words if the vocabulary of the above sentences
considers only nouns and pronouns. However, this representation is not very expressive as it
does not capture much information about similar words. For example, King and Queen are
rulers.

Building Distributed representation with an Example

Let us imagine that we want to express the words “Man”, “Woman”, ”King”, “Queen”, “Ruler”
using 2-D vectors such that they preserve the following semantics:

 King-Man+Woman —> Queen


 King-Man —> Ruler
 Ruler+Woman —> Queen

Note that we have used the standard vector representation of the variable with an overhead
arrow. For example,

King is the vector representation of the word “King”. If the rules of vector arithmetic should
hold, one way to choose vectors satisfying the above rules is as shown below.

Man =[0,1], Woman=[2,1]


King =[1,1], Queen=[3,1]
Ruler =[1,0]

The vector representation of the words above can be visualized in a two-dimensional vector
space as shown below.
Variants of CNN: DenseNet
• Densely connected Convolutional networktwork

• A DenseNet is a type of convolutional neural networktwork. DenseNet architecture, each


layer is connected to every other layer, hence the name Densely Connected
Convolutional Network

• A DenseNet is a type of convolutional neural network that utilises dense connections


between layers, through Dense Blocks, where we connect all layers (with matching
feature-map sizes) directly with each other

• DenseNet was developed specifically to improve the declined accuracy caused by the
vanishing gradient in high-level neural networks. In simpler terms, due to the longer
path between the input layer and the output layer, the information vanishes before
reaching its destination
• DenseNets, are the next step on the way to keep increasing the depth of deep
convolutional networks The problems arise with CNNs when they go deeper DenseNet
was specially developed to improve accuracy caused by the vanishing gradient in high-
level neural networks due to the long distance between input and output layers & the
information vanishes before reaching its destination
Block :
Each Layer add the features on the top of the existing feature map All featured map are
concated with each other

Transition:

Perform the down sampling on featured map

You might also like