AITools Unit-4
AITools Unit-4
Neural Networks are networks used in Deep Learning that work similar to
the human nervous system.
An artificial neuron is a mathematical function conceived as a model
of biological neurons, a neural network. Artificial neurons are elementary
units in an artificial neural network.
The artificial neuron receives one or more inputs and sums them to produce
an output by applying some activation function.
Fig. Artificial Neuron
For the above general model of artificial neural network, the net input can
be calculated as follows:
yin= (x1.w1+x2.w2+x3.w3…xm.wm) + bias
i.e., yin= ∑imxi.wi+ bias
where, Xi is set of features and Wi is set of weights.
Bias is the information which can impact output without being dependent
on any feature.
The output can be calculated by applying the activation function over the
net input.
Y=F(yin)
Each AN has an internal state, which is called an activation signal. Output
signals, which are produced after combining the input signals and activation
rule, may be sent to other units.
Eg.
Q) Construct a single layer neural network for implementing OR, AND,
NOT gates.
1. XOR:
XOR(A,B) = (A+B)*(AB)|
This sort of a relationship cannot be modeled using a single neuron. Thus
we will use a multi-layer network.
The idea behind using multiple layers is that complex relations can be
broken into simpler functions and combined.
2. XNOR function looks like:
Types of AF:
The Activation Functions can be basically divided into 3 types-
1. Binary step Activation Function
2. Linear Activation Function
3. Non-linear Activation Functions
TanH is also like logistic sigmoid but in better way. The range of the
TanHfunction is from -1 to +1.
TanH is often preferred over the sigmoid neuron because it is zero centred.
The advantage is that the negative inputs will be mapped strongly negative
and the zero inputs will be mapped near zero in tanh graph.
tanh(x) = 2 * sigmoid(2x) - 1
The ReLU is the most used activation function. It is used in almost all
convolution neural networks in hidden layers only.
The ReLU is half rectified(from bottom). f(z) = 0, if z < 0
= z, otherwise
R(z) = max(0,z)
The range is 0 to inf.
Advantages
Avoids vanishing gradient problem.
Computationally efficient—allows the network to converge very
quickly
Non-linear—although it looks like a linear function, ReLU has a
derivative function and allows for backpropagation
Disadvantages
Can only be used with a hidden layer
hard to train on small datasets and need much data for learning non-
linear behavior.
The Dying ReLU problem—when inputs approach zero, or are
negative, the gradient of the function becomes zero, the network
cannot perform backpropagation and cannot learn.
We needed the Leaky ReLU activation function to solve the „Dying ReLU‟
problem.
Leaky ReLU we do not make all negative inputs to zero but to a value near
to zero which solves the major issue of ReLU activation function.
R(z) = max(0.1*z,z)
Advantages
Prevents dying ReLU problem—this variation of ReLU has a small
positive slope in the negative area, so it does enable backpropagation,
even for negative input values
Otherwise like ReLU
Disadvantages
Results not consistent—leaky ReLU does not provide consistent
predictions for negative input values.
3.4 Softmax:
sigma = softmax
zi = input vector
e^{zi}} = standard exponential function for input vector
K = number of classes in the multi-class classifier
e^{zj} = standard exponential function for output vector
e^{zj} = standard exponential function for output vector
Advantages
Able to handle multiple classes only one class in other activation
functions—normalizes the outputs for each class between 0 and 1with the
sum of the probabilities been equal to 1, and divides by their sum, giving the
probability of the input value being in a specific class.
Useful for output neurons—typically Softmax is used only for the output
layer, for neural networks that need to classify inputs into multiple
categories.
These models are called feedforward because information flows through the
function being evaluated from x, through the intermediate computations
used to define f, and finally to the output y. There are no feedback
connections in which outputs of the model are fed back into itself.
Feedforward neural networks are called networks because they are typically
represented by composing together many different functions. The model is
associated with a directed acyclic graph describing how the functions are
composed together.
For example, we might have three functions f (1), f (2), and f (3) connected in a
chain, to form f(x) = f(3)(f (2)(f(1) (x ))). This chain structure is most commonly
used structure of neural networks. In this case, f (1) is called the first layer of
the network called input layer used to feed the input into the network; f (2)
is called the second layer called hidden layer used to train the neural
network, and so on. The final layer of a feedforward network is called the
output layer that provides the output of the network. The overall length of
the chain gives the depth of the model and width of the model is number of
neurons in the input layer. It is from this terminology that the name “deep
learning” arises.
3. Feature engineering:
Feature engineering is the process of transforming raw data into features
that better represent the underlying problem to the predictive models,
resulting in improved model accuracy on unseen data.Feature engineering
turn your inputs into things the algorithm can understand.
5. Execution time
Usually, a deep learning algorithm takes a long time to train. This is
because there are so many parameters in a deep learning algorithm that
training them takes longer than usual. Whereas machine learning
comparatively takes much less time to train, ranging from a few seconds to
a few hours.
This is turn is completely reversed on testing time. At test time, deep
learning algorithm takes much less time to run. Whereas, if you compare it
with k-nearest neighbors (ML algorithm), test time increases on increasing
the size of data. Although this is not applicable on all machine learning
algorithms, as some of them have small testing times too.
6. Interpretability:
Suppose we use deep learning to give automated scoring to essays. The
performance it gives in scoring is quite excellent and is near human
performance. But there‟s is an issue. It does not reveal why it has given that
score. Indeed mathematically you can find out which nodes of a deep neural
network were activated, but we don‟t know what there neurons were
supposed to model and what these layers of neurons were doing collectively.
So we fail to interpret the results.
On the other hand, machine learning algorithms like decision trees
give us crisp rules as to why it chose what it chose, so it is particularly easy
to interpret the reasoning behind it. Therefore, algorithms like decision trees
and linear/logistic regression are primarily used in industry for
interpretability.
Characteristic ML DL
requires less amount of requires large amount
Data dependencies for data for identifying of data for better
Performance rules performance
work on low-end heavily depend on high-
Hardware dependencies machines end machines
Deep learning
algorithms try to learn
high-level features from
features need to be data.
identified by an expert Deep learning reduces
and then hand-coded as the task of developing
per the domain and new feature extractor
Feature engineering data type for every problem.
Break the problem into
Problem Solving parts, finds and Solves the problem end-
approach combines the solution to-end
Takes much small time
for training but may
Execution time Takes more time for take more time for
training and less time testing depending on
for testing the algorithm like KNN
Fails to interpret the Easy to interpret the
Interpretability results results
Q) Explain various applications of Deep Learning.
There are various interesting applications for Deep Learning that made
impossible things before a decade into reality. Some of them are:
1. Color restoration, where a given image in greyscale is automatically
turned into a colored one.
2. Recognizing hand written message.
3. Adding sound to a silent video that matches with the scene taking
place.
4. Self-driving cars
5. Computer Vision: for applications like vehicle number plate
identification and facial recognition.
6. Information Retrieval: for applications like search engines, both text
search, and image search.
7. Marketing: for applications like automated email marketing, target
identification
8. Medical Diagnosis: for applications like cancer identification, anomaly
detection
9. Natural Language Processing: for applications like sentiment analysis,
photo tagging
10. Online Advertising, etc
Algorithm:
1. Initialize the weights and biases.
2. Iteratively repeat the following steps until defined number of times or
threshold value is reached:
i. Calculate network output using forward propagation.
ii. Calculate error between actual and predicted values.
iii. Propagate the error back into the network and update weights
and biases using the equations:
Fig. illustrating BP
Example:
Forward Propagation:
Therefore,
z1 = 0.415 a1 = 0.6023 z2 = 0.9210 a2 = 0.7153
Let us consider,
epochs = 1000 threshold = 0.001
learning rate = 0.4 T = 0.25
4.
E = 1/2(T-a2)2 = 0.1083
Eqn # 1: 𝑧1 = 𝑥1 ∙ 𝑤1 + 𝑏1
Eqn # 2: 𝑎1 = (𝑧1) = 1/( 1+ 𝑒 −𝑧1 )
Eqn # 3: 𝑧2 = 𝑎1 ∙ 𝑤2 + 𝑏2
Eqn # 4: 𝑎2 = (𝑧2) = 1 /(1+ 𝑒 −𝑧2)
Eqn # 5: 𝐸 = 1 /2 (𝑇 − 𝑎2)2
Updating w2:
= 0.45 - 0.4(-(0.25-0.7153))*(0.7153(1-0.7153))*(0.6023)
= 0.45 - 0.4*0.05706
= 0.427
Updating b2:
= 0.65 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*1
= 0.65 - 0.4*0.0948
= 0.612
Updating w1:
= 0.40 - 0.4*(-(0.25-0.7153))*(0.7153(1-0.7153))*0.45*0.6023(1-
0.6023)*1
= 0.40-0.4*0.01021
= 0.3959
Eg. In the below problem the derivatives with respect to weights are very
small.
So when we do back propagation, we keep multiplying the factors that are
less than 1 by each other and gradients tend to smaller and smaller by
moving backward in the network.
This means the neurons in the earlier layers learn very slowly. The result is
a training process that takes too long and prediction accuracy is
compromised.
MLP‟s use one perceptron for each input (e.g. pixel in an image, multiplied
by 3 in RGB case). The amount of weights rapidly becomes unmanageable
for large images. For a 224 x 224 pixel image with 3 color channels there are
around 150,528 weights that must be trained! As a result, difficulties arise
whilst training and overfitting can occur.
In the same way we apply to remaining (above is for red image, then we do
same for green and blue) images. We can apply more than one filter. More
filters we use, we can preserve spatial dimensions better.
Pooling layer:
Pooling layer objective is to reduce the spatial dimensions of the data
propagating through the network.
1. Max Pooling is the most common, for each section of the
image we scan and keep the highest value.