Deep Learning: Feedforward Networks Explained
Deep Learning: Feedforward Networks Explained
● They receive one or more input signals. These input signals can come from either the
raw data set or from neurons positioned at a previous layer of the neural net.
● They send some output signals to neurons deeper in the neural net through a synapse.
As you can see, neurons in a deep learning model are capable of having synapses that
connect to more than one neuron in the preceding layer. Each synapse has an associated weight,
which impacts the preceding neuron’s importance in the overall neural network.
Weights are a very important topic in the field of deep learning because adjusting a
model’s weights is the primary way through which deep learning models are trained. You’ll see
this in practice later on when we build our first neural networks from scratch.
Once a neuron receives its inputs from the neurons in the preceding layer of the model, it
adds up each signal multiplied by its corresponding weight and passes them on to an activation
function, like this:
The activation function calculates the output value for the neuron. This output value is then
passed on to the next layer of the neural network through another synapse.
This serves as a broad overview of deep learning neurons. Do not worry if it was a lot to take in
– we’ll learn much more about neurons in the rest of this tutorial. For now, it’s sufficient for you
to have a high-level understanding of how they are structured in a deep learning model.
2. Perceptron Algorithm
The original Perceptron was designed to take a number of binary inputs, and produce
one binary output (0 or 1).
The idea was to use different weights to represent the importance of each input, and that
the sum of the values should be greater than a threshold value before making a decision
like true or false (0 or 1).
Perceptron Example
Algorithm
Frank Rosenblatt suggested this algorithm:
● Threshold = 1.5
● x1 * w1 = 1 * 0.7 = 0.7
● x2 * w2 = 0 * 0.6 = 0
● x3 * w3 = 1 * 0.5 = 0.5
● x4 * w4 = 0 * 0.3 = 0
● x5 * w5 = 1 * 0.4 = 0.4
● Return true if the sum > 1.5 ("Yes I will go to the Concert")
Search engines, machine translation, and mobile applications all rely on deep learning technologies.
It works by stimulating the human brain in terms of identifying and creating patterns from
various types of input.
A feedforward neural network is a key component of this fantastic technology since it aids software
developers with pattern recognition and classification, non-linear regression, and function
approximation.
The data enters the input nodes, travels through the hidden layers, and eventually exits
the output nodes. The network is devoid of links that would allow the information exiting the
output node to be sent back into the network.
The feedfоrwаrd netwоrk will mар y = f (x; θ). It then memorizes the value of θ that most
closely approximates the function.
Layer of input
It contains the neurons that receive input. The data is subsequently passed on to the next tier. The
input layer’s total number of neurons is equal to the number of variables in the dataset.
Hidden layer
This is the intermediate layer, which is concealed between the input and output layers. This layer
has a large number of neurons that perform alterations on the inputs. They then communicate
with the output layer
Output layer
It is the last layer and is depending on the model’s construction. Additionally, the output layer is
the expected feature, as you are aware of the desired outcome.
Neurons weights
Weights are used to describe the strength of a connection between neurons. The range of a
weight’s value is from 0 to 1.
Where,
b = biases
a = output vectors
x = input
As many neurons as there are classes in the output layer. To show the difference between
the predicted and actual distributions of probabilities.
● Gene regulation and feedforward: Throughout this, a theme predominates throughout the
famous networks, and this motif has been demonstrated to be a feedforward system for
detecting non-temporary atmospheric alteration.
4. Gradient Descent
Gradient descent (GD) is an iterative first-order optimisation algorithm used to find a local
minimum/maximum of a given function. This method is commonly used in machine
learning (ML) and deep learning(DL) to minimise a cost/loss function (e.g. in a linear
regression). Due to its importance and ease of implementation, this algorithm is usually taught at
the beginning of almost all machine learning courses.
However, its use is not limited to ML/DL only, it’s being widely used also in areas like:
● computer games
● mechanical engineering
Function requirements
Gradient descent algorithm does not work for all functions. There are two specific
requirements. A function has to be:
● differentiable
● convex
Next requirement — function has to be convex. For a univariate function, this means that
the line segment connecting two function’s points lays on or above its curve (it does not cross it).
If it does it means that it has a local minimum which is not a global one.
Mathematically, for two points x₁, x₂ laying on the function’s curve this condition is
expressed as:
where λ denotes a point’s location on a section line and its value has to be between 0 (left
point) and 1 (right point), e.g. λ=0.5 means a location in the middle.
Because the second derivative is always bigger than 0, our function is strictly convex.
It is also possible to use quasi-convex functions with a gradient descent algorithm. However,
often they have so-called saddle points (called also minimax points) where the algorithm can
get stuck (we will demonstrate it later in the article). An example of a quasi-convex function is:
The value of this expression is zero for x=0 and x=1. These locations are called an inflexion
point — a place where the curvature changes sign — meaning it changes from convex to
concave or vice-versa. By analysing this equation we conclude that :
Now we see that point x=0 has both first and second derivative equal to zero meaning this is a
saddle point and point x=1.5 is a global minimum.
Let’s look at the graph of this function. As calculated before a saddle point is at x=0 and
minimum at x=1.5.
For multivariate functions the most appropriate check if a point is a saddle point is to calculate a
Hessian matrix which involves a bit more complex calculations and is beyond the scope of this
article.
Gradient Descent Algorithm iteratively calculates the next point using gradient at the current
position, scales it (by a learning rate) and subtracts obtained value from the current position
(makes a step). It subtracts the value because we want to minimise the function (to maximise it
would be adding). This process can be written as:
There’s an important parameter η which scales the gradient and thus controls the step size. In
machine learning, it is called learning rate and have a strong influence on performance.
● The smaller learning rate the longer GD converges, or may reach maximum
iteration before reaching the optimum point
● If learning rate is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.
3. make a scaled step in the opposite direction to the gradient (objective: minimise)
● step size is smaller than the tolerance (due to scaling or a small gradient).
Consider the following Back propagation neural network example diagram to understand:
2. Input is modeled using real weights W. The weights are usually randomly selected.
3. Calculate the output for every neuron from the input layer, to the hidden layers, to the
output layer.
5. Travel back from the output layer to the hidden layer to adjust the weights such that the
error is decreased.
● It is a flexible method as it does not require prior knowledge about the network
● It does not need any special mention of the features of the function to be learned.
● Static Back-propagation
● Recurrent Backpropagation
Static back-propagation:
It is one kind of backpropagation network which produces a mapping of a static input for
static output. It is useful to solve static classification issues like optical character recognition.
Recurrent Backpropagation:
Recurrent Back propagation in data mining is fed forward until a fixed value is achieved.
After that, the error is computed and propagated backward.
The main difference between both of these methods is: that the mapping is rapid in static
back-propagation while it is nonstatic in recurrent backpropagation.
Backpropagation in neural network can be explained with the help of “Shoe Lace” analogy
● Discomfort (bias)
● Back propagation algorithm in data mining can be quite sensitive to noisy data
● You need to use the matrix-based approach for backpropagation instead of mini-batch.
Refer PDF
7. Regularization
Regularization is a set of techniques that can prevent overfitting in neural networks and
thus improve the accuracy of a Deep Learning model when facing completely new data from the
problem domain. In this article, we will address the most popular regularization techniques
which are called L1, L2, and dropout.
One of the most important aspects when training neural networks is avoiding overfitting.
What is Regularization?
Simple speaking: Regularization refers to a set of different techniques that lower the
complexity of a neural network model during training, and thus prevent the overfitting.
There are three very popular and efficient regularization techniques called L1, L2, and
dropout which we are going to discuss in the following.
L2 Regularization
The L2 regularization is the most common type of all regularization techniques and is
also commonly known as weight decay or Ride Regression.
During the L2 regularization the loss function of the neural network as extended by a so-
called regularization term, which is called here Ω.
The regularization term Ω is defined as the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix. The regularization
term is weighted by the scalar alpha divided by two and added to the regular loss function that is
chosen for the current task. This leads to a new expression for the loss function:
Alpha is sometimes called as the regularization rate and is an additional hyper parameter
we introduce into the neural network. Simply speaking alpha determines how much we
regularize our model.
In the next step we can compute the gradient of the new loss function and put the
gradient into the update rule for the weights:
Some reformulations of the update rule lead to the expression which very much looks like
the update rule for the weights during regular gradient descent:
The only difference is that by adding the regularization term we introduce an additional
subtraction from the current weights (first term in the equation).
In other words independent of the gradient of the loss function we are making our
weights a little bit smaller each time an update is performed.
L1 Regularization
In the case of L1 regularization (also knows as Lasso regression), we simply use another
regularization term Ω. This term is the sum of the absolute values of the weight parameters in a
weight matrix:
Dropout
In a nutshell, dropout means that during training with some probability P a neuron of the
neural network gets turned off during training. Let’s look at a visual example.
Assume on the left side we have a feedforward neural network with no dropout. Using
dropout with let’s say a probability of P=0.5 that a random neuron gets turned off during training
would result in a neural network on the right side.
In this case, you can observe that approximately half of the neurons are not active and are
not considered as a part of the neural network. And as you can observe the neural network
becomes simpler.
A simpler version of the neural network results in less complexity that can reduce
overfitting. The deactivation of neurons with a certain probability P is applied at each forward
propagation and weight update step.
8. Autoencoders
What is an autoencoder?
An autoencoder is a type of artificial neural network used to learn data encodings in an
unsupervised manner.
1. Encoder: A module that compresses the train-validate-test set input data into an encoded
representation that is typically several orders of magnitude smaller than the input data.
3. Decoder: A module that helps the network“decompress” the knowledge representations and
reconstructs the data back from its encoded form. The output is then compared with a ground
truth.
Encoder
The encoder is a set of convolutional blocks followed by pooling modules that compress
the input to the model into a compact section called the bottleneck.
The bottleneck is followed by the decoder that consists of a series of upsampling modules
to bring the compressed feature back into the form of an image. In case of simple autoencoders,
the output is expected to be the same as the input with reduced noise.
Bottleneck
The most important part of the neural network, and ironically the smallest one, is the
bottleneck. The bottleneck exists to restrict the flow of information to the decoder from the
encoder, thus,allowing only the most vital information to pass through.
Since the bottleneck is designed in such a way that the maximum information possessed
by an image is captured in it, we can say that the bottleneck helps us form a knowledge-
representation of the input.
Thus, the encoder-decoder structure helps us extract the most from an image in the form
of data and establish useful correlations between various inputs within the network.
As a rule of thumb, remember this: The smaller the bottleneck, the lower the risk of
overfitting. However Very small bottlenecks would restrict the amount of information storable,
which increases the chances of important information slipping out through the pooling layers of
the encoder.
Decoder
Finally, the decoder is a set of upsampling and convolutional blocks that reconstructs the
bottleneck's output.
Since the input to the decoder is a compressed knowledge representation, the decoder
serves as a “decompressor” and builds back the image from its latent attributes.
1. Code size: The code size or the size of the bottleneck is the most important
hyperparameter used to tune the autoencoder. The bottleneck size decides how much the
data has to be compressed. This can also act as a regularisation term.
3. Number of nodes per layer: The number of nodes per layer defines the weights we use
per layer. Typically, the number of nodes decreases with each subsequent layer in the
autoencoder as the input to each of these layers becomes smaller across the layers.
4. Reconstruction Loss: The loss function we use to train the autoencoder is highly
dependent on the type of input and output we want the autoencoder to adapt to. If we are
working with image data, the most popular loss functions for reconstruction are MSE
Loss and L1 Loss. In case the inputs and outputs are within the range [0,1], as in
MNIST, we can also make use of Binary Cross Entropy as the reconstruction loss.
5 types of autoencoders
The idea of autoencoders for neural networks isn't new.
The first applications date to the 1980s. Initially used for dimensionality reduction and
feature learning, an autoencoder concept has evolved over the years and is now widely used for
learning generative models of data.
1. Undercomplete autoencoders
2. Sparse autoencoders
3. Contractive autoencoders
4. Denoising autoencoders
1. Undercomplete Autoencoders
1. An undercomplete autoencoder is one of the simplest types of autoencoders.
2. The way it works is very straightforward—
3. Undercomplete autoencoder takes in an image and tries to predict the same image as
output, thus reconstructing the image from the compressed bottleneck region.
4. Undercomplete autoencoders are truly unsupervised as they do not take any form of
label, the target being the same as the input.
5. The primary use of autoencoders like such is the generation of the latent space or the
bottleneck, which forms a compressed substitute of the input data and can be easily
decompressed back with the help of the network when needed.
6. This form of compression in the data can be modeled as a form of dimensionality
reduction.
2. Sparse Autoencoders
Sparse autoencoders are similar to the undercomplete autoencoders in that they use the
same image as input and ground truth. However—
Since it is not possible to design a neural network that has a flexible number of nodes at
its hidden layers, sparse autoencoders work by penalizing the activation of some neurons in
hidden layers.
In other words, the loss function has a term that calculates the number of neurons that
have been activated and provides a penalty that is directly proportional to that.
This penalty, called the sparsity function, prevents the neural network from activating
more neurons and serves as a regularizer.
While typical regularizers work by creating a penalty on the size of the weights at the
nodes, sparsity regularizer works by creating a penalty on the number of nodes activated.
This form of regularization allows the network to have nodes in hidden layers dedicated
to find specific features in images during training and treating the regularization problem as a
problem separate from the latent space problem.
We can thus set latent space dimensionality at the bottleneck without worrying about
regularization.
There are two primary ways in which the sparsity regularizer term can be incorporated into
the loss function.
1. L1 Loss: In here, we add the magnitude of the sparsity regularizer as we do for general
regularizers:
2. KL-Divergence: In this case, we consider the activations over a collection of samples at
once rather than summing them as in the L1 Loss method. We constrain the average
activation of each neuron over this collection.
3. Contractive Autoencoders
The contractive autoencoder also has a regularization term to prevent the network from
learning the identity function and mapping input into the output.
Contractive autoencoders work on the basis that similar inputs should have similar
encodings and a similar latent space representation. It means that the latent space should not vary
by a huge amount for minor variations in the input.
To train a model that works along with this constraint, we have to ensure that the
derivatives of the hidden layer activations are small with respect to the input.
Mathematically:
While the reconstruction loss wants the model to tell differences between two inputs and
observe variations in the data, the frobenius norm of the derivatives says that the model should
be able to ignore variations in the input data.
Putting these two contradictory conditions into one loss function enables us to train a
network where the hidden layers now capture only the most essential information. This
information is necessary to separate images and ignore information that is non-discriminatory in
nature, and therefore, not important.
The gradient is summed over all training samples, and a frobenius norm of the same is taken.
4. Denoising Autoencoders
Denoising autoencoders, as the name suggests, are autoencoders that remove noise from
an image. As opposed to autoencoders we’ve already covered, this is the first of its kind that
does not have the input image as its ground truth.
In denoising autoencoders, we feed a noisy version of the image, where noise has been
added via digital alterations. The noisy image is fed to the encoder-decoder architecture, and the
output is compared with the ground truth image.
The denoising autoencoder gets rid of noise by learning a representation of the input
where the noise can be filtered out easily.
While removing noise directly from the image seems difficult, the autoencoder performs
this by mapping the input data into a lower-dimensional manifold (like in undercomplete
autoencoders), where filtering of noise becomes much easier.
5. Variational Autoencoders
Standard and variational autoencoders learn to represent the input just in a compressed
form called the latent space or the bottleneck.
Therefore, the latent space formed after training the model is not necessarily continuous
and, in effect, might not be easy to interpolate.
For example—
While these attributes explain the image and can be used in reconstructing the image
from the compressed latent space, they do not allow the latent attributes to be expressed in a
probabilistic fashion.
Variational autoencoders deal with this specific topic and express their latent attributes as
a probability distribution, leading to the formation of a continuous latent space that can be easily
sampled and interpolated.
When fed the same input, a variational autoencoder would construct latent attributes in
the following manner:
The latent attributes are then sampled from the latent distribution formed and fed to the
decoder, reconstructing the input.
The motivation behind expressing the latent attributes as a probability distribution can be
very easily understood via statistical expressions.
Applications of autoencoders
1. Dimensionality reduction
2. Image denoising
3. Generation of image and time series data
4. Anomaly Detection
Deep Neural Networks: Difficulty of training deep neural networks, Greedy layer wise training.
Training deep learning models is a crucial part of applying this powerful technology to a wide range of
tasks. However, training a model involves a lot of challenges from overfitting and underfitting to slow
convergence and vanishing gradients; many factors can impact the performance and reliability of a deep
learning model. Understanding these issues and how to mitigate them makes it possible to achieve better
results and more robust models.
Network Compression
There is an increasing demand for computing power and storage. With that in mind, building higher
efficiency models optimized for more performance with lesser computations is important. Here is where
compression kicks in to give a better performance to computations ratio. A few methods for Network
compression include,
Parameter Pruning And Sharing - Reducing redundant parameters which do not affect the
performance.
Low-Rank Factorisation - Matrix decomposition to obtain informative parameters of CNN.
Compact Convolutional Filters - A special Kernel with reduced parameters to save storage and
computation space.
Knowledge Distillation - Train a compact model to reproduce a complex one.
Pruning
Pruning is the method of reducing the number of parameters by removing redundant or insensitive
neurons. There are two methods to prune
Pruning by weights involves removing individual weights from the network that are found to be
unnecessary or redundant. This can be done using a variety of methods, such as setting small
weights to zero, using magnitude-based pruning, or using functional pruning. Pruning by weights
can help to reduce the size of the network and improve its efficiency, but it can also reduce the
capacity of the network and may lead to a loss of performance. This keeps the architecture of the
model the same.
Pruning by neurons involves removing entire neurons or groups of neurons from the network that
are found to be unnecessary or redundant. This can be done using a variety of methods, such as
using importance scores to identify and remove less important neurons or using evolutionary
algorithms to evolve smaller networks.
In a Neural Network, the weights, biases, and other parameters are initialized to take up 32-bit of
information. 32-bit variables add precision to the model by adding more values after the decimal point. But
in practical applications, reducing the precision from 32-bit to 16-bit floating point does not change the
model's output.
Suppose we have a simple neural network with one input, one hidden layer with two neurons, and one
output. The weights and biases of the network are initialized using 32-bit floating point values. The
network is trained on a dataset and can achieve a certain level of accuracy.
Now, we want to improve the efficiency of the network by reducing the precision of the weights and biases
from 32-bit to 16-bit floating point values. We can do this by simply casting the 32-bit values to 16-bit
values and using them to initialize the network.
During training, we find that the network can achieve the same level of accuracy with the reduced
precision weights and biases as it did with the full precision weights and biases. This means that we were
able to improve the efficiency of the network without sacrificing performance.
Overall, reducing the precision of the weights and biases in a neural network can be a useful technique for
improving efficiency without sacrificing performance, but it is important to consider the trade-offs involved
carefully and to carefully test the impact of reducing precision on the final model.
Bilinear CNN
A Bi-Linear CNN architecture solves the problem of image recognition in fine-grained image datasets.
A Bilinear CNN contains 2(or sometimes more) CNN feature extractors, which identify different features.
The different feature extractors are combined as a Bilinear Vector to find the relationship between the
different features. This is passed through a classifier to obtain the results. For example, If the task was to
recognize a bird, One feature extractor would identify the tail, while the other would identify the beak.
These two will then come together to infer if the image presented to it is a bird at the bilinear vector.
Exploding gradient problem is similar but with very large weight values. If the weights become too big,
the backpropagation algorithm makes them bigger and bigger, making it difficult, sometimes impossible, to
compute and train.
The reason for the Vanishing gradient is the Tanh and Sigmoid Activation function. Tanh activation
function contains values between -1 and 1. At the same time, the Sigmoid activation function contains
values between 0 and 1. These two activation functions map any value between small ranges, thus creating
the vanishing gradient problem.
One problem that can occur with activation functions is non-saturation, which refers to the inability of the
activation function to saturate, or reach its maximum output value. This can lead to several issues, such as
the inability of the network to learn, poor generalization, and slow convergence.
There are several common activation functions that can suffer from non-saturation, including the linear
activation function and the sigmoid activation function. These activation functions can saturate when the
input to the activation function is very large or very small, but they may not saturate for intermediate input
values. This can make it difficult for the network to learn and can lead to slow convergence.
Suppose we have a simple neural network with one input, one hidden layer with two neurons, and one
output. The hidden layer uses the sigmoid activation function, which is defined as:
The input to the first hidden neuron is -5, and the input to the second hidden neuron is 5. The output of the
first hidden neuron will be very close to 0 (since e(−5)e(−5) is very small), and the output of the second
hidden neuron will be very close to 1 (since e(5)e(5) is very large).
This means that the sigmoid activation function has saturated for both of these input values, and the neural
network will not be able to learn effectively because the gradient of the activation function will be very
close to 0 for these inputs. This can lead to slow convergence and poor generalization.
Using an activation function that is more effective at saturating, such as the ReLU, can help to avoid this
issue and improve the performance of the neural network.
Batch Normalization
Batch Normalization is a method of Normalizing the data after every activation function. Normalizing
means bringing every value of data within the same range. Keeping the range of data small equals faster
training time. It also helps with the problem of internal co-variate shift, which means during
backpropagation, each neuron tries to minimize its loss concerning the result obtained in the previous layer.
The result of the previous layer might change in the next iteration, and this problem gets amplified with
deeper layers. So it feels like chasing a moving target. This problem is referred to as the internal covariate
shift. Batch Normalization helps minimize this problem as well.
Gradient Clipping
Gradient clipping is a very useful technique to overcome the exploding gradient problem. In this method,
the gradients are limited to a threshold and made sure it doesn't exceed the value. Keeping the gradients in
check helps to escape the exploding gradient problem. It is especially useful in LSTMs where Exploding
Gradient might occur due to the Tanh and Sigmoid Activation functions. There are two methods to
implement gradient clipping.
Clipping by value: The gradients are given a min and max value. The max or min bound value is
taken if the gradient exceeds the bounds.
Norm clipping: All the gradients are clipped by a certain value to always stay below the norm
value.
Greedy Layerwise Pre-training
1. To solve the vanishing gradient descent problem we use this Greedy Layerwise Pre-
training technique.
2. Greedy Layerwise Pre-training is called greedy because it’s greedy algorithm is
used. Greedy algorithms break a problem into many components, then solve for the
optimal version of each component in isolation.
3. Lets us see the mechanism of a greedy layer-wise pretraining method.
4. First, we make a base model of the input and output layer; later, we train the model using
the available dataset.
5. After training the model, we remove the output layer and store it in another variable.
Add a new hidden layer in the model that will be the first hidden layer of the model and
re-add the output layer in the model.
6. Now there are three layers in the model, the input layer, the hidden layer1, and the output
layer, and once again, train the model after inserting the hidden layer1. To add one more
hidden layer, remove the output layer set all the layers as non-trainable(no further change
in weights of the input layer and hidden layer1).
7. Now insert the new hidden layer2 in the model and re-add the output layer. Train the
model after inserting the new hidden layer
8. The model structure will be in the following order, input layer, hidden layer1, hidden
layer2, output layer.
9. Repeat the above steps for every new hidden layer you want to add. (each time you
insert a new hidden layer, perform training on the model using the same dataset)
Fig. Training a 4-layer network
1. Given unsupervised feature learning algorithm L, Which takes as input a training set of examples
and returns an encoder or feature function f
2. Raw input data is X, with one row per example, f (1) (x) is output of the first stage encoder on X
3. In the case where fine tuning is performed we use a learner T which takes an initial function f,
input examples X (and in the supervised fine-tuning case, associated targets Y) and returns a
tuned function. The no of stages is m
Greedy Layer-wise Unsupervised Pretraining algorithm
Transfer learning, used in machine learning, is the reuse of a pre-trained model on a new
problem. In transfer learning, a machine exploits the knowledge gained from a previous task to
improve generalization about another
• Conventional machine learning and deep learning algorithms, so far, have been
traditionally designed to work in isolation these algorithms are trained to solve specific
tasks. The models have to be rebuilt from scratch once the feature-space distribution changes.
• Traditional learning is isolated and occurs purely based on specific tasks, datasets and
training separate isolated models on them. No knowledge is retained which can be transferred
from one model to another
• Transfer learning is the idea of overcoming the isolated learning paradigm and utilizing
knowledge acquired for one task to solve related ones.
• In transfer learning, you can leverage knowledge (features, weights etc) from previously
trained models for training newer models and even tackle problems like having less data for
the newer task
Multitask learning
• In the case of multitask learning, several tasks are learned simultaneously without
distinction between the source and targets.
• In this case, the learner receives information about multiple tasks at once, as compared to
transfer learning, where the learner initially has no idea about the target task.
• This helps to transfer knowledge from each scenario and develop a rich combined feature
vector from all the varied scenarios of the same domain. The learner optimizes the
learning/performance across all of the n tasks through some shared knowledge.
One-shot Learning
• Deep learning systems are data-hungry by nature, such that they need many training
examples to learn the weights. This is one of the limiting aspects of deep neural
networks.
• One-shot learning is a variant of transfer learning, where we try to infer the required
output based on just one or a few training examples. This is essentially helpful in real-
world scenarios where it is not possible to have labeled data for every possible class.
Zero-shot Learning
• Zero-data learning or zero-short learning methods make clever adjustments during the
training stage itself to exploit additional information to understand unseen data.
Distributed Representation
1. The concept of distributed representations is often central to deep learning, particularly as
it applies to natural language tasks.
2. Distributed representations are a principled way of representing entities (say, cats
or dogs) in terms of vectors.
3. Entities sharing common properties have vector representations that are nearer to
each other.
Figure 2. Shows a distributed representation of this same set of shapes where information about
the shape is represented with multiple “memory units” for concepts related to orientation and
shape.
Now the “memory units” contain information both about an individual shape and how each
shape relates to each other.
Figure 2. Distributed representation of shapes
When we come across a new shape with our distributed representation, such as the circle in
Figure 3, we don’t increase the dimensionality and we also know some information about the
circle, as it relates to the other shapes, even though we haven’t seen it before.
Figure 3. Distributed representation of a circle; this representation is more useful as it provides us with
information about how this new shape is related to our other shapes.
While this shape example is oversimplified, it serves as a great high-level, abstract introduction
to distributed representations.
Notice, in the case of our distributed representation for shapes, that we selected four concepts or
features (vertical, horizontal, rectangle, ellipse) for our representation.
In this case, we were required to know what these important and distinguishing features were
beforehand, and in many cases, this is a difficult or impossible thing to know. It is for this reason
that feature engineering is such a crucial task in classical machine learning techniques.
Finding a good representation of our data is critical to the success of downstream tasks like
classification or clustering. One of the reasons that deep learning has seen tremendous success is
a neural networks’ ability to learn rich distributed representations of data.
Example
Distributed representation is a principled way of representing entities (say, cats or dogs) in terms of
vectors.
Entities sharing common properties have vector representations that are nearer to each other.
The input and output of the Machine Learning (ML) models are often numeric. This requires
finding a suitable numeric representation of text. Consider that the following sentences are used
to train an ML model.
He is a King.
King is a man.
Queen is a woman.
She is a Queen.
King and Queen are rulers.
For the words to be fed as an input to the model, it needs a mathematical representation. One-Hot
encoding of words to vectors is one way to get this representation. The dimension of each vector
is equal to the number of unique words in all the sentences. This collection of unique words is
referred to as vocabulary.
This is an example of local representation of words if the vocabulary of the above sentences
considers only nouns and pronouns. However, this representation is not very expressive as it
does not capture much information about similar words. For example, King and Queen are
rulers.
Let us imagine that we want to express the words “Man”, “Woman”, ”King”, “Queen”, “Ruler”
using 2-D vectors such that they preserve the following semantics:
Note that we have used the standard vector representation of the variable with an overhead
arrow. For example,
King is the vector representation of the word “King”. If the rules of vector arithmetic should
hold, one way to choose vectors satisfying the above rules is as shown below.
The vector representation of the words above can be visualized in a two-dimensional vector
space as shown below.
Variants of CNN: DenseNet
• Densely connected Convolutional networktwork
• DenseNet was developed specifically to improve the declined accuracy caused by the
vanishing gradient in high-level neural networks. In simpler terms, due to the longer
path between the input layer and the output layer, the information vanishes before
reaching its destination
• DenseNets, are the next step on the way to keep increasing the depth of deep
convolutional networks The problems arise with CNNs when they go deeper DenseNet
was specially developed to improve accuracy caused by the vanishing gradient in high-
level neural networks due to the long distance between input and output layers & the
information vanishes before reaching its destination
Block :
Each Layer add the features on the top of the existing feature map All featured map are
concated with each other
Transition: