MACHINE LEARNING
(20BT60501)
COURSE DESCRIPTION:
Concept learning, General to specific ordering, Decision tree learning, Support vector
machine, Artificial neural networks, Multilayer neural networks, Bayesian learning,
Instance based learning, reinforcement learning.
1
Subject :MACHINE LEARNING -(20BT60501)
Topic: Unit III – ARTIFICIAL NEURAL NETWORKS
Prepared By:
Dr.J.Avanija
Professor
Dept. of CSE
Sree Vidyanikethan Engineering College
Tirupati.
2
Unit III – ARTIFICIAL NEURAL NETWORKS
Neural network representations
Appropriate problems for neural network learning
Perceptrons
Multilayer networks and Backpropagation algorithm
Convergence and local minima
Representational power of feedforward networks
Hypothesis space search and inductive bias
Hidden layer representations, Generalization
Overfitting, Stopping criterion
An Example - Face Recognition.
3
Features of ANNs
ANNs perform well, generally better with larger number of hidden units.
More hidden units generally produce lower error.
Determining network topology is difficult.
Choosing single learning rate is impossible.
Difficult to reduce training time by altering the network topology or learning parameters.
Highly accurate predictive models for a large number of different types of problems.
Ease of use and deployment – poor.
o Connection between nodes
o Number of units
o Training level
Learning Capability
Model is built one record at a time
4
Features of ANNs . . .
Weaknesses
Long training time
Require a number of parameters typically best determined
empirically, e.g., the network topology or “structure."
Poor interpretability: Difficult to interpret the symbolic
meaning behind the learned weights and of “hidden units"
in the network
Strengths
High tolerance to noisy data
Ability to classify untrained patterns
Well-suited for continuous-valued inputs and outputs
Successful on a wide array of real-world data
Algorithms are inherently parallel
Techniques have recently been developed for the
extraction of rules from trained neural networks
5
Hypothesis Space and Inductive Bias
Every possible combination of network weights is a potential candidate.
All potential candidates form the Hypothesis Space.
A Hypothesis space can be defined as an N-dimensional Euclidean space of N network weights.
The hypothesis space of a neural network is a continuous space.
Error E of a network is differentiable with respect to the continuous parameters of the hypothesis
space.
The above two factors results in a well-defined error gradient which leads to efficient search
strategies.
Inductive Bias – can be defined as the set of assumptions (implicit or explicit) made by learning
algorithms in order to perform induction (or generalization).
Inductive Bias of NN – “Smooth Interpolation between the data points”.
6
Representational Power of Feedforward NNs
Representational power specifies the power of NNs. What set of problems can be represented by NN?
Boolean Functions –
• Every Boolean function can be represented by NN with 2 layers.
• The maximum no. of hidden neurons required = no. of samples in training data.
• NN may be designed with less number of hidden neurons.
Continuous Functions –
• Every bounded continuous function can be approximated with arbitrarily small error E by
using NN with 2 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.
Arbitrary Functions –
• Any arbitrary function can be approximated by a NN with 3 layers.
• Hidden layer uses sigmoid function.
• Output layer uses unthresholded linear function.
7
Hidden Layer Representations
Hidden layers capture the characteristics of training data to learn the target function.
Training samples only restrain the number of input neurons and output neurons.
The hidden layers and hidden neurons are not explicitly introduced by human designer.
Hence, NN has capability to adjust its structure and parameters to discover efficient NN to solve
the given problem with minimal possible error E.
This ability of Multilayer NNs to automatically discover useful representations of hidden layers
is a key feature of ANN learning.
The more the number of hidden layers/neurons, the more complex problems can be
represented by the NN.
8
Convergence and Local Minima
BPN uses Gradient Descent search to search through the hypothesis space.
The objective is to search through the hypothesis space in the direction so as to reduce the error
E.
Because the error surface for multilayer NNs may contain multiple local minima, the gradient
descent search may get stuck at one of these local minima.
As a result BPN is not guaranteed to converge at global minima, but may converge at a local
minima.
9
Convergence and Local Minima . . .
In spite of this disadvantage BPN is a largely popular model for a NN consisting of a large number
of weights.
Large number of weights correspond to very high dimensional error surfaces ;
local minima for a weight may not be local minima for other weights ;
Hence they provide escape routes BPN may not get stuck at a local minima.
If initial weights are near to zero
during early iterations, sigmoid function provides smooth & linear function ;
as iterations pass, weights tend to increase their value in order to reduce the error E ;
this is the time where NN represents complex/nonlinear functions ;
this is the region which may have more local minima, and BPN may get stuck at a
local minima ;
But, it may be hoped that by this time BPN has reached close enough to the global
minima;
and it is acceptable even if BPN gets stuck at a local minima closer to global minima.
10
Convergence and Local Minima . . .
Regarding gradient descent over complex error surfaces, the following heuristics may be
attempted to alleviate the problem of local minima:
1) Add momentum to Chain Rule –
Momentum Term
Momentum has two effects on the gradient descent:
It keeps the descent in the same direction through the iterations and
It keeps the descent going through local minima and flat regions.
11
Convergence and Local Minima . . .
Regarding gradient descent over complex error surfaces, the following heuristics may be
attempted to alleviate the problem of local minima: . . .
2) Use Stochastic Gradient Descent instead of Standard Gradient Descent
Stochastic Gradient Descent travels through approximate error surfaces which will have
different local minima. Hence, it can be hoped that BPN will not get stuck in one of
these local minima.
3) Train multiple NNs with different initial weights.
Different initial weights lead to different error surface and different local minima;
The NN with best performance can be selected as final model.
12
Generalization, Overfitting
Generalization is the capability of the model to perform well on unseen data.
Overfitting Why does overfitting occur at
later iterations of learning
process?
Through iterations
weights tend to increase their
values to reduce error E
Larger weight values increase
model complexity
Leads to overfitting
Solutions:
Weight Decay
Validation Data
K-Fold Cross Validation
13
Case Study: ALVINN
Autonomous Land Vehicle In a Neural Network - 1989
ALVINN is a neural network designed to steer an
autonomous vehicle driving at normal speeds on
public highways.
A forward-pointed camera is mounted on the
vehicle.
The camera takes images of resolution 120 x 128
pixels.
Currently ALVINN takes images from a camera and
a laser range finder as input and produces as
output the direction the vehicle should travel in
order to follow the road.
ALVINN is trained for 5 minutes to observe and
learn from human driving.
Further it has been tested successfully for
autonomous driving of 90 miles with up to 70 miles
speed on public highways (driving in the left lane
14 of
highway, with other vehicles present)
Case Study: ALVINN . . .
ALVINN is a 2-layer Backpropagation NN
with
960 input neurons, 4 hidden neurons
and 30 output neurons.
Here the individual units are interconnected
in layers that form a directed acyclic graph.
It is a feedforward network.
It is a fully connected network.
The output layer is a linear representation of
the direction the vehicle should travel in order
to keep the vehicle on road.
15
Case Study: ALVINN . . .
The 120 x 128 image taken by camera is
converted into a coarse-resolution image of 30
x 32.
Each coarse resolution pixel intensity is
obtained by selecting the intensity of a single
pixel at random from the appropriate region
within the high-resolution image.
This 30 x 32 coarse-resolution image is
used as input to the network.
This method significantly reduces the
computation required to produce the coarse-
resolution image from the available high-
resolution image.
This efficiency is especially important when
the network must be used to process many
images per second while autonomously driving
the vehicle.
Output from each output unit corresponds 16
to a particular steering direction, and the
Case Study: ALVINN . . .
The large matrix of black and white boxes
depicts the weights from the 30 x 32 pixel
inputs into the hidden unit. Here, a white
box indicates a positive weight, a black box
a negative weight, and the size of the box
indicates the weight magnitude.
The smaller rectangular diagram directly
above the large matrix shows the weights
from this hidden unit to each of the 30
output units.
17
Case Study: Face Recognition
Application of Backpropagation NN to learn the direction of face in the images – left, right, up,
down.
The learning task here involves classifying camera images of faces of various people in various
poses.
18
Case Study: Face Recognition . . . .
Data Collection
Images of 20 different people were collected, including approximately 32 images per person,
• varying the person's expression (happy, sad, angry, neutral),
• Varying the direction in which they were looking (left, right, straight ahead, up), and
whether or not they were wearing sunglasses.
• Varying the background behind the person,
• Varying the clothing worn by the person,
• Varying the position of the person’s face within the image.
In total, 624 greyscale images were collected, each with a resolution of 120 x 128, with each
image pixel described by a greyscale intensity value between 0 (black) and 255 (white).
19
Case Study: Face Recognition . . . .
Input Encoding
The 120 x 128 image is encoded into a coarse-resolution image of 30 x 32 pixels.
Each coarse resolution pixel intensity is calculated as the mean of the corresponding high-resolution
pixel intensities.
This 30 x 32 coarse-resolution image is used as input to the network.
Data Scaling - The pixel intensity values ranging from 0 to 255 were linearly scaled to range from
0 to 1 so that network inputs would have values in the same interval as the hidden unit and output
unit activations.
Output Encoding
1-of-n output encoding is used.
Each output neuron produced a real-valued number between 0.1 and 0.9.
The NN’s prediction will be equal to the neuron with highest value.
20
Case Study: Face Recognition . . . .
Network Graph Structure
The Backpropagation network is an acyclic directed graph of sigmoid units.
It is a feedforward network.
It is a fully connected network.
A 2 layer NN with 960 input neurons, 4 output neurons. (2899 weights)
Experimentation is done with
• 3 hidden neurons – produced model with accuracy of 90%. (less training time)
• Up to 30 hidden neurons – produced model with accuracy of 91% - 92%. (more training
time)
Using 260 training images, the training time on a Sun Sparc5 workstation was approximately
• 5 minutes for the 3 hidden unit network,
• 1 hour for the 30 hidden unit network.
21
Case Study: Face Recognition . . . .
In these learning experiments the
• Learning rate l was set to 0.3,
• Momentum α was set to 0.3.
Lower values for both parameters produced roughly equivalent generalization accuracy, but longer
training times.
If these values are set too high, training fails to converge to a network with acceptable error over
the training set.
Full gradient descent was used in all these experiments (in contrast to the stochastic
approximation to gradient descent).
Input unit weights were initialized to zero.
Network weights in the output units were initialized to small random values.
22
Case Study: Face Recognition . . . .
Number of training iterations was selected by partitioning the available data into a training set
and a separate validation set.
Gradient descent was used to minimize the error over the training set, and after every 50 gradient
descent steps the performance of the network was evaluated over the validation set.
The final selected network was the one with the highest accuracy over the validation set.
The final reported was measured over a test dataset.
23
Case Study: Face Recognition . . . . Large +ve
weight Large -ve
weight
24