0% found this document useful (0 votes)
51 views

Unit -4 Artificial Neural Networks

Uploaded by

dnyneswhar2655
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
51 views

Unit -4 Artificial Neural Networks

Uploaded by

dnyneswhar2655
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
You are on page 1/ 33

UNIT -IV ANN

Artificial Neural Network (ANN)


Artificial Neural Network (ANN) is a deep learning
algorithm that emerged and evolved from the idea
of Biological Neural Networks of human brains. An
attempt to simulate the workings of the human brain
culminated in the emergence of ANN. ANN works very
similar to the biological neural networks but doesn’t
exactly resemble its workings.

ANN algorithm would accept only numeric and structured


data as input. To accept unstructured and non-numeric
data formats such as Image, Text, and
Speech, Convolutional Neural Networks
(CNN), and Recursive Neural Networks (RNN) are
used respectively. In this post, we concentrate only on
Artificial Neural Networks.

Biological neurons vs Artificial neurons

Structure of Biological neurons and their functions

 Dendrites receive incoming signals.

 Soma (cell body) is responsible for


processing the input and carries biochemical
information.
 Axon is tubular in structure responsible for
the transmission of signals.

 Synapse is present at the end of the axon and


is responsible for connecting other neurons.

Structure of Artificial neurons and their functions

 A neural network with a single layer is called


a perceptron. A multi-layer perceptron is
called Artificial Neural Networks.

 A Neural network can possess any number of


layers. Each layer can have one or more
neurons or units. Each of the neurons is
interconnected with each and every other
neuron. Each layer could have
different activation functions as well.

 ANN consists of two phases Forward


propagation and Backpropagation. The
forward propagation involves multiplying
weights, adding bias, and applying activation
function to the inputs and propagating it
forward.

 The backpropagation step is the most


important step which usually involves finding
optimal parameters for the model by
propagating in the backward direction of the
Neural network layers. The backpropagation
requires optimization function to find the
optimal weights for the model.

 ANN can be applied to both Regression and


Classification tasks by changing the
activation functions of the output layers
accordingly. (Sigmoid activation function for
binary classification, Softmax activation
function for multi-class classification and
Linear activation function for Regression).

Perceptron. Image Source

Why Neural Networks?


 Traditional Machine Learning algorithms tend
to perform at the same level when the data
size increases but ANN outperforms
traditional Machine Learning algorithms
when the data size is huge as shown in the
graph below.

 Feature Learning. The ANN tries to learn


hierarchically in an incremental manner layer
by layer. Due to this reason, it is not
necessary to perform feature engineering
explicitly.

 Neural Networks can handle unstructured


data like images, text, and speech. When the
data contains unstructured data the neural
network algorithms such as CNN
(Convolutional Neural Networks) and RNN
(Recurrent Neural Networks) are used.

How ANN works


The working of ANN can be broken down into two phases,

 Forward Propagation

 Back Propagation

Forward Propagation

 Forward propagation involves multiplying


feature values with weights, adding bias, and
then applying an activation function to each
neuron in the neural network.
 Multiplying feature values with weights and
adding bias to each neuron is basically
applying Linear Regression. If we apply
Sigmoid function to it then each neuron is
basically performing a Logistic Regression.

Activation functions

 The purpose of an activation function is to


introduce non-linearity to the data.
Introducing non-linearity helps to identify the
underlying patterns which are complex. It is
also used to scale the value to a particular
interval. For example, the sigmoid activation
function scales the value between 0 and 1.

Logistic or Sigmoid function

 Logistic/ Sigmoid function scales the values


between 0 and 1.

 It is used in the output layer for Binary


classification.
 It may cause a vanishing gradient problem
during backpropagation and slows the
training time.

Sigmoid function

Tanh function

 Tanh is the short form for Hyperbolic


Tangent. Tanh function scales the values
between -1 and 1.

Hyperbolic Tangent function

ReLU function

 ReLU (Rectified Linear Unit) outputs the


same number if x>0 and outputs 0 if x<0.

 It prevents the vanishing gradient problem


but introduces an exploding gradient
problem during backpropagation. The
exploding gradient problem can be prevented
by capping gradients.

ReLU function

Leaky ReLU function

 Leaky ReLU is very much similar to ReLU but


when x<0 it returns (0.01 * x) instead of 0.

 If the data is normalized using Z-Score it may


contain negative values and ReLU would fail
to consider it but leaky ReLU overcomes this
problem.

Leaky ReLU function

Backpropagation

 Backpropagation is done to find the optimal


value for parameters for the model by
iteratively updating parameters by partially
differentiating gradients of the loss
function with respect to the parameters.

 An optimization function is applied to perform


backpropagation. The objective of an
optimization function is to find the optimal
value for parameters.

The optimization functions available are,

 Gradient Descent

 Adam optimizer

 Gradient Descent with momentum

 RMS Prop (Root Mean Square Prop)

The Chain rule of Calculus plays an important role in


backpropagation. The formula below denotes partial
differentiation of Loss (L) with respect to Weights/
parameters (w).

A small change in weights ‘w’ influences the change in the


value ‘z’ (∂𝑧/∂𝑤). A small change in the value ‘z’ influences
the change in the activation ‘a’ (∂a/∂z). A small change in
the activation ‘a’ influences the change in the Loss
function ‘L’ (∂L/∂a).
Chain rule

Description of the values in the Chain rule

Terminologies:

Metrics

 A metric is used to gauge the performance of


the model.

 Metric functions are similar to cost functions,


except that the results from evaluating a
metric are not used when training the model.
Note that you may use any cost function as a
metric.

 We have used Mean Squared Logarithmic


Error as a metric and cost function.
Mean Squared Logarithmic Error (MSLE) and Root Mean Squared Logarithmic
Error(RMSLE)

Epoch

 A single pass through the training data is


called an epoch. The training data is fed to
the model in mini-batches and when all the
mini-batches of the training data are fed to
the model that constitutes an epoch.

Hyperparameters

Hyperparameters are the tunable parameters that are


not produced by a model which means the users must
provide a value for these parameters. The values of
hyperparameters that we provide affect the training
process so hyperparameter optimization comes to the
rescue.

The Hyperparameters used in this ANN model are,

 Number of layers

 Number of units/ neurons in a layer


 Activation function

 Initialization of weights

 Loss function

 Metric

 Optimizer

 Number of epochs

Coding ANN in Tensorflow

Load the preprocessed data

The data you feed to the ANN must be preprocessed


thoroughly to yield reliable results. The training data has
been preprocessed already. The preprocessing steps
involved are,

 MICE Imputation

 Log transformation

 Square root transformation

 Ordinal Encoding

 Target Encoding

 Z-Score Normalization
For the detailed implementation of the above-mentioned
steps refer my notebook on data preprocessing

Notebook Link

Neural Architecture

 The ANN model that we are going to use,


consists of seven layers including one input
layer, one output layer, and five hidden
layers.

 The first layer (input layer) consists of 128


units/ neurons with the ReLU activation
function.

 The second, third, and fourth layers consist of


256 hidden units/ neurons with the ReLU
activation function.

 The fifth and sixth layer consists of 384


hidden units with ReLU activation function.

 The last layer (output layer) consists of one


single neuron which outputs an array with the
shape (1, N) where N is the number of
features.

Multilayer Perceptron
The Multilayer Perceptron was developed to tackle this
limitation. It is a neural network where the mapping
between inputs and output is non-linear.

A Multilayer Perceptron has input and output layers, and


one or more hidden layers with many neurons stacked
together. And while in the Perceptron the neuron must
have an activation function that imposes a threshold, like
ReLU or sigmoid, neurons in a Multilayer Perceptron can
use any arbitrary activation function.

Multilayer Perceptron. (Image by author)

Multilayer Perceptron falls under the category


of feedforward algorithms, because inputs are combined
with the initial weights in a weighted sum and subjected to
the activation function, just like in the Perceptron. But the
difference is that each linear combination is propagated to
the next layer.

Each layer is feeding the next one with the result of their
computation, their internal representation of the data. This
goes all the way through the hidden layers to the output
layer.

But it has more to it.

If the algorithm only computed the weighted sums in each


neuron, propagated results to the output layer, and
stopped there, it wouldn’t be able to learn the weights that
minimize the cost function. If the algorithm only computed
one iteration, there would be no actual learning.

This is where Backpropagation[7] comes into play.

Backpropagation
Backpropagation is the learning mechanism that allows the
Multilayer Perceptron to iteratively adjust the weights in
the network, with the goal of minimizing the cost function.

There is one hard requirement for backpropagation to


work properly. The function that combines inputs and
weights in a neuron, for instance the weighted sum, and
the threshold function, for instance ReLU, must be
differentiable. These functions must have a bounded
derivative, because Gradient Descent is typically the
optimization function used in MultiLayer Perceptron.

Multilayer Perceptron, highlighting the Feedfoward and Backpropagation steps. (Image by


author)

In each iteration, after the weighted sums are forwarded


through all layers, the gradient of the Mean Squared
Error is computed across all input and output pairs. Then,
to propagate it back, the weights of the first hidden layer
are updated with the value of the gradient. That’s how the
weights are propagated back to the starting point of the
neural network!
One iteration of Gradient Descent. (Image by author)

This process keeps going until gradient for each input-


output pair has converged, meaning the newly computed
gradient hasn’t changed more than a
specified convergence threshold, compared to the previous
iteration.

Let’s see this with a real-world example.

Using Perceptron for Sentiment Analysis


Your parents have a cozy bed and breakfast in the
countryside with the traditional guestbook in the lobby.
Every guest is welcome to write a note before they leave
and, so far, very few leave without writing a short note or
inspirational quote. Some even leave drawings of Molly,
the family dog.

Summer season is getting to a close, which means cleaning


time, before work starts picking up again for the holidays.
In the old storage room, you’ve stumbled upon a box full of
guestbooks your parents kept over the years. Your first
instinct? Let’s read everything!

After reading a few pages, you just had a much better idea.
Why not try to understand if guests left a positive or
negative message?

You’re a Data Scientist, so this is the perfect task for a


binary classifier.

So you picked a handful of guestbooks at random, to use as


training set, transcribed all the messages, gave it a
classification of positive or negative sentiment, and then
asked your cousins to classify them as well.

In Natural Language Processing tasks, some of the text


can be ambiguous, so usually you have a corpus of text
where the labels were agreed upon by 3 experts, to avoid
ties.

Sample of guest messages. (Image by author)

With the final labels assigned to the entire corpus, you


decided to fit the data to a Perceptron, the simplest
neural network of all.
But before building the model itself, you needed to turn
that free text into a format the Machine Learning model
could work with.

In this case, you represented the text from the guestbooks


as a vector using the Term Frequency — Inverse Document
Frequency (TF-IDF). This method encodes any kind of text
as a statistic of how frequent each word, or term, is in each
sentence and the entire document.

In Python you used TfidfVectorizer method


from ScikitLearn, removing English stop-words and even
applying L1 normalization.
TfidfVectorizer(stop_words='english', lowercase=True, norm='l1')

BACKPROPAGATION ALGORITHM

Backpropagation algorithm is probably the most


fundamental building block in a neural network. It was
first introduced in 1960s and almost 30 years later (1989)
popularized by Rumelhart, Hinton and Williams in a paper
called “Learning representations by back-propagating
errors”.

The algorithm is used to effectively train a neural


network through a method called chain rule. In simple
terms, after each forward pass through a network,
backpropagation performs a backward pass while
adjusting the model’s parameters (weights and biases).

In this article, I would like to go over the mathematical


process of training and optimizing a simple 4-layer neural
network. I believe this would help the reader understand
how backpropagation works as well as realize its
importance.

Define the neural network model


The 4-layer neural network consists of 4 neurons for
the input layer, 4 neurons for the hidden layers and 1
neuron for the output layer.

Simple 4-layer neural network illustration

Input layer

The neurons, colored in purple, represent the input data.


These can be as simple as scalars or more complex like
vectors or multidimensional matrices.
Equation for input x_i

The first set of activations (a) are equal to the input


values. NB: “activation” is the neuron’s value after
applying an activation function. See below.

Hidden layers

The final values at the hidden neurons, colored


in green, are computed using z^l — weighted inputs in
layer l, and a^l— activations in layer l. For layer 2 and 3
the equations are:

 l=2

Equations for z² and a²

 l=3

Equations for z³ and a³

W² and W³ are the weights in layer 2 and 3 while b² and b³


are the biases in those layers.
Activations a² and a³ are computed using an activation
function f. Typically, this function f is non-
linear (e.g. sigmoid, ReLU, tanh) and allows the network
to learn complex patterns in data. We won’t go over the
details of how activation functions work, but, if interested,
I strongly recommend reading this great article.

Looking carefully, you can see that all of x, z², a², z³, a³,
W¹, W², b¹ and b² are missing their subscripts presented in
the 4-layer network illustration above. The reason is that
we have combined all parameter values in matrices,
grouped by layers. This is the standard way of working
with neural networks and one should be comfortable with
the calculations. However, I will go over the equations to
clear out any confusion.

Let’s pick layer 2 and its parameters as an example. The


same operations can be applied to any layer in the
network.

 W¹ is a weight matrix of shape (n,


m) where n is the number of output neurons
(neurons in the next layer) and m is the
number of input neurons (neurons in the
previous layer). For us, n = 2 and m = 4.
Equation for W¹

NB: The first number in any weight’s subscript


matches the index of the neuron in the next layer (in
our case this is the Hidden_2 layer) and the second
number matches the index of the neuron in previous
layer (in our case this is the Input layer).

 x is the input vector of shape (m,


1) where m is the number of input neurons.
For us, m = 4.

Equation for x

 b¹ is a bias vector of shape (n , 1) where n is


the number of neurons in the current layer.
For us, n = 2.
Equation for b¹

Following the equation for z², we can use the above


definitions of W¹, x and b¹ to derive “Equation for z²”:

Equation for z²

Now carefully observe the neural network illustration from


above.

Input and Hidden_1 layers

You will see that z² can be expressed using (z_1)² and


(z_2)² where (z_1)² and (z_2)² are the sums of the
multiplication between every input x_i with the
corresponding weight (W_ij)¹.

This leads to the same “Equation for z²” and proofs that
the matrix representations for z², a², z³ and a³ are correct.

Output layer

The final part of a neural network is the output layer which


produces the predicated value. In our simple example, it is
presented as a single neuron, colored
in blue and evaluated as follows:

Equation for output s

Again, we are using the matrix representation to simplify


the equation. One can use the above techniques to
understand the underlying logic. Please leave any
comments below if you find yourself lost in the
equations — I would love to help!

Forward propagation and evaluation


The equations above form network’s forward propagation.
Here is a short overview:
Overview of forward propagation equations colored by layer

The final step in a forward pass is to evaluate


the predicted output s against an expected output y.

The output y is part of the training dataset (x,


y) where x is the input (as we saw in the previous section).

Evaluation between s and y happens through a cost


function. This can be as simple as MSE (mean squared
error) or more complex like cross-entropy.

We name this cost function C and denote it as follows:

Equation for cost function C

were cost can be equal to MSE, cross-entropy or any other


cost function.
Based on C’s value, the model “knows” how much to adjust
its parameters in order to get closer to the expected
output y. This happens using the backpropagation
algorithm.

Backpropagation and computing gradients


According to the paper from 1989, backpropagation:

repeatedly adjusts the weights of the connections in


the network so as to minimize a measure of the
difference between the actual output vector of the net
and the desired output vector.

and

the ability to create useful new features distinguishes


back-propagation from earlier, simpler methods…

In other words, backpropagation aims to minimize the


cost function by adjusting network’s weights and
biases. The level of adjustment is determined by the
gradients of the cost function with respect to those
parameters.

One question may arise — why computing gradients?

To answer this, we first need to revisit some calculus


terminology:
 Gradient of a function C(x_1, x_2, …, x_m) in
point x is a vector of the partial derivatives of
C in x.

Equation for derivative of C in x

 The derivative of a function C measures the


sensitivity to change of the function value
(output value) with respect to a change in its
argument x (input value). In other words, the
derivative tells us the direction C is going.

 The gradient shows how much the parameter


x needs to change (in positive or negative
direction) to minimize C.

Compute those gradients happens using a technique


called chain rule.

For a single weight (w_jk)^l, the gradient is:


Equations for derivative of C in a single weight (w_jk)^l

Similar set of equations can be applied to (b_j)^l:

Equations for derivative of C in a single bias (b_j)^l

The common part in both equations is often called “local


gradient” and is expressed as follows:
Equation for local gradient

The “local gradient” can easily be determined using the


chain rule. I won’t go over the process now but if you have
any questions, please comment below.

The gradients allow us to optimize the model’s parameters:

Algorithm for optimizing weights and biases (also called “Gradient


descent”)

 Initial values of w and b are randomly chosen.

 Epsilon (e) is the learning rate. It determines


the gradient’s influence.

 w and b are matrix representations of the


weights and biases. Derivative
of C in w or b can be calculated using partial
derivatives of C in the individual weights or
biases.

 Termination condition is met once the cost


function is minimized.

I would like to dedicate the final part of this section to a


simple example in which we will calculate the gradient
of C with respect to a single weight (w_22)².

Let’s zoom in on the bottom part of the above neural


network:

Visual representation of backpropagation in a neural network

Weight (w_22)² connects (a_2)² and (z_2)², so computing


the gradient requires applying the chain rule
through (z_2)³ and (a_2)³:

Equation for derivative of C in (w_22)²


Calculating the final value of derivative
of C in (a_2)³ requires knowledge of the function C.
Since C is dependent on (a_2)³, calculating the derivative
should be fairly straightforward.
 LIMITATIONS OF MLPs

Multilayer perceptrons (MLPs) are a type of artificial neural network that is commonly
used for classification and regression tasks. However, MLPs have some limitations
that can make them difficult to use in certain applications.

Here are some of the limitations of MLPs:

 Overfitting: MLPs can be prone to overfitting, which means that they can learn
the training data too well and not generalize well to new data. This can be a
problem if the training data is not large or representative enough.
 Vanishing gradients: MLPs with multiple hidden layers can suffer from
vanishing gradients, which means that the weights of the connections
between neurons can become very small. This can make it difficult for the
network to learn, as the updates to the weights become very small.
 Interpretability: MLPs can be difficult to interpret, as the weights of the
connections between neurons can be complex and difficult to understand.
This can make it difficult to understand how the network is making its
predictions.
 Computational complexity: MLPs can be computationally expensive to train,
especially if the training data is large or the network has many hidden layers.

Despite these limitations, MLPs are a powerful tool for machine learning and can be
used to solve a variety of problems. However, it is important to be aware of the
limitations of MLPs and to take steps to mitigate them.

Here are some ways to mitigate the limitations of MLPs:

 Regularization: Regularization is a technique that can help to prevent


overfitting. There are a variety of regularization techniques available, such as
L1 regularization and L2 regularization.
 Data augmentation: Data augmentation is a technique that can be used to
increase the size of the training data. This can help to prevent overfitting by
providing the network with more data to learn from.
 Deep learning: Deep learning is a type of machine learning that uses neural
networks with multiple hidden layers. Deep learning can be more powerful
than MLPs, but it can also be more difficult to train and interpret.

 Generalized Delta Learning Rule


The generalized delta rule is a learning rule used in artificial neural networks. It is a
generalization of the delta rule, which is a simple learning rule that can be used to
train single-layer neural networks.

The generalized delta rule can be used to train neural networks with multiple layers.
It works by adjusting the weights of the connections between neurons so that the
network's output is closer to the desired output.

The generalized delta rule is a gradient descent algorithm, which means that it
updates the weights of the connections in the direction of the steepest descent of the
error function. The error function is a measure of how far the network's output is from
the desired output.

The generalized delta rule is a powerful learning rule that can be used to train neural
networks to solve a variety of problems. However, it can be computationally
expensive to train neural networks with many layers using the generalized delta rule.

Here is the formula for the generalized delta rule:

w_ji = w_ji + η * δ_i * x_j

where:

 w_ji is the weight of the connection between neuron i and neuron j


 η is the learning rate
 δ_i is the error at neuron i
 x_j is the input to neuron j
The error at neuron i is calculated as the difference between the desired output and
the actual output of the neuron. The learning rate is a parameter that controls how
much the weights are updated. The input to neuron j is the output of neuron i.

The generalized delta rule is a powerful learning rule that can be used to train neural
networks to solve a variety of problems. However, it can be computationally
expensive to train neural networks with many layers using the generalized delta rule.

You might also like