Multi Layer Perceptron Annotated
Multi Layer Perceptron Annotated
FALL 2021-2022
Assoc. Prof. Yusuf Yaslan & Assist. Prof. Ayşe Tosun
Lecture Notes from Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press AND
Coursera Introduction to Machine Learning Course by Duke University
Introduction
• Artificial Neural Networks take their inspiration from the brain.
• Our aim is not to understand the brain per se but to build useful
machines.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 3
The Seasons of Neural Networks
This slide is adopted from the following course: Coursera Introduction to Machine Learning
Course by Duke University
The Seasons of Neural Networks
This slide is adopted from the following course: Coursera Introduction to Machine Learning Course
by Duke University
The Seasons of Neural Networks
This slide is adopted from the following course: Coursera Introduction to Machine Learning Course
by Duke University
Perceptron d
y w j x j w 0 w T x
j 1
w w 0 , w1 ,...,wd T
x 1, x1 ,..., xd T
(Rosenblatt, 1962)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 9
What a Perceptron Does
• Regression: y=wx+w0
• Classification: y=1(wx+w0>0)
y y
s y
w0 w0
w w
x
w0
x x
x0=+1
1
y sigmoid o
1 exp w T x
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 10
Regression:
K Outputs d
y i w ij x j w i 0 w Ti x
j 1
y Wx
Classification:
oi w Ti x
expoi
yi
k expok
choose C i
if y i max y k
k
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 11
Training
• Online (instances seen one by one) vs batch (whole
sample) learning:
• No need to store the whole sample
• Problem may change in time
• Wear and degradation in system components
• Stochastic gradient-descent: Update after a single
pattern
• Generic update rule (LMS rule):
w ijt rit y it x tj
Update LearningFa ctor DesiredOut put ActualOutput Input
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 12
Training a Perceptron: Regression
• Regression (Linear output):
1 t 1 t
E w | x , r r y r w x
t t
2
t t 2
2
T t 2
w tj r t y t x tj
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 13
Classification
• Single sigmoid output
y t sigmoid w T xt
E t w | xt , r t r t log y t 1 r t log 1 y t
w tj r t y t x tj
• K>2 softmax outputs
exp w Ti xt
t
y E t w i i | xt , r t rit log y it
k exp w T t
kx i
w ijt rit y it x tj
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 14
Learning Boolean AND
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 15
XOR
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 16
x1 XOR x2 = (x1 AND ~x2) OR (~x1 AND x2)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 17
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecutre Notes
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecture Notes
This slide is obtained from Nando Freitas Lecture Notes
Multilayer Perceptrons
H
y i v Ti z v ih zh v i 0
h 1
zh sigmoid w Th x
1
d
1 exp j 1w hj x j w h 0
zh sigmoid w Th x
1
1 exp d
j 1
w hj x j w h 0
E E y i zh
w hj y i zh w hj
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 27
1
E W, v | X r y
t t 2
Regression 2 t
v h r t y t zht
H
y v z v 0
t t
h h t
h 1
Backward
E
Forward w hj
w hj
zh sigmoid w x T
h
E y t zht
t t
t y z h w hj
r t y t v h zht 1 zht x tj
t
x r t y t v h zht 1 zht x tj
t
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 29
Regression with Multiple Outputs
yi
1
E W,V | X ri y i
t t 2
2 t i vih
H
y it v ih zht v i 0
h 1 zh
v ih rit y it zht whj
t
xj
t
w hj ri y i v ih zh 1 zht x tj
t t
t i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 30
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 32
MLP with 2 Hidden Layer
t
E
w it w it 1
w i
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 42
Improving Convergence
Adaptive learning rate: In gradient descent, the learning factor
η determines the magnitude of change to be made in the
parameter. It is generally taken between 0.0 and 1.0, mostly
less than or equal to 0.2. It can be made adaptive for faster
convergence, where it is kept large when learning takes place
and is decreased when learning slows down
a if E t E t
b otherwise
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 43
Overfitting/Overtraining
• We know from previous chapters that an overcomplex model
memorizes the noise in the training set and does not generalize to
the validation set.
• Similarly in an MLP, when the number of hidden units is large, the
generalization accuracy deteriorates
Overfitting/Overtraining
Number of weights: H (d+1)+(H+1)K
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 45
Overfitting/Overtraining
• A similar behavior happens when training is continued too long: As
• more training epochs are made, the error on the training set
decreases, but the error on the validation set starts to increase
beyond a certain point.
• Early stopping: Learning should be stopped early to alleviate this
problem of overtraining.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 47
Tuning the Network Size
• To find the optimal network size, the most common approach is to
try many different architectures, train them all on the training set,
and choose the one that generalizes best to the validation set.
• Another approach is to incorporate this structural adaptation into
the learning algorithm.
• In the destructive approach, we start with a large network and
gradually remove units and/or connections that are not necessary
• In the constructive approach, we start with a small network and
gradually add units and/or connections to improve performance.
Tuning the Network Size
• Destructive • Constructive
• Weight decay: • Growing networks
E
w i w i
w i
E ' E
2
i
i
w 2
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 52
Dimensionality Reduction
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 54
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 55
Learning Time
• Applications:
• Sequence recognition: Speech recognition
• Sequence reproduction: Time-series prediction
• Sequence association
• Network architectures
• Time-delay networks (Waibel et al., 1989)
• Recurrent networks (Rumelhart et al., 1986)
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 56
Recurrent Networks
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 57
Recurrent Networks
If the sequences have a small maximum length, then unfolding in
time can be used to convert an arbitrary recurrent network to an
equivalent feedforward network.
Lecture Notes for E Alpaydın 2010 Introduction to Machine Learning 2e © The MIT Press (V1.0) 59