Neural Networks and
Fuzzy Systems
Multi-layer Feed forward Networks
Dr. Tamer Ahmed Farrag
Course No.: 803522-3
Course Outline
Part I : Neural Networks (11 weeks)
• Introduction to Machine Learning
• Fundamental Concepts of Artificial Neural Networks
(ANN)
• Single layer Perception Classifier
• Multi-layer Feed forward Networks
• Single layer FeedBack Networks
• Unsupervised learning
Part II : Fuzzy Systems (4 weeks)
• Fuzzy set theory
• Fuzzy Systems
2
Outline
• Why we need Multi-layer Feed forward Networks
(MLFF)?
• Error Function (or Cost Function or Loss function)
• Gradient Descent
• Backpropagation
3
Why we need Multi-layer Feed forward
Networks (MLFF)?
• Overcoming failure of single layer perceptron in
solving nonlinear problems.
• First Suggestion:
• Divide the problem space into smaller linearly separable
regions
• Use a perceptron for each linearly separable region
• Combine the output of multiple hidden neurons to
produce a final decision neuron.
4
Region 1
Region 2
Why we need Multi-layer Feed forward
Networks (MLFF)?
• Second suggestion
• In some cases we need a curve decision boundary or we try to solve
more complicated classification and regression problems.
• So, we need to:
• Add more layers
• Increase a number of neurons in each layer.
• Use non linear activation function in
the hidden layers.
• So , we need Multi-layer Feed forward Networks (MLFF).
5
Notation for Multi-Layer Networks
• Dealing with multi-layer networks is easy if a sensible notation is adopted.
• We simply need another label (n) to tell us which layer in the network we
are dealing with.
• Each unit j in layer n receives activations 𝑜𝑢𝑡𝑖
(𝑛−1)
𝑤𝑖𝑗
(𝑛)
from the previous
layer of processing units and sends activations 𝑜𝑢𝑡𝑗
(𝑛)
to the next layer of
units.
6
1
2
3
1
2
layer (0) layer (1)
𝒘𝒊𝒋
(𝟏)
layer (n-1) layer (n)
𝒘𝒊𝒋
(𝒏)
ANN Representation
(1 input layer + 1 hidden layer +1 output layer)
7
for example:
𝑧1
(1)
= (𝑤11
(1)
𝑥1 + 𝑤21
(1)
𝑥2+ 𝑤31
(1)
𝑥3 + 𝑏1
(1)
)
𝑎1
(1)
= 𝑓 𝑧1
(1)
=σ (𝑧1
(1)
)
𝑧2
(2)
= (𝑤12
(2)
𝑎1
(1)
+ 𝑤22
(2)
𝑎2
(1)
+ 𝑤32
(2)
𝑎3
(1)
+ 𝑏2
(2)
)
𝑦2 = 𝑎2
(2)
= 𝑓 𝑧2
(2)
=σ (𝑧2
(2)
)
𝒛𝒋
(𝒍)
=
𝒋
𝒘𝒊𝒋
(𝒍)
𝒂𝒊
(𝒍−𝟏)
+ 𝒃𝒋
(𝒍)
𝒂𝒋
(𝒍)
= 𝒇 𝒛𝒋
𝒍
= σ 𝒛𝒋
𝒍
layer (0)
𝑥1= 𝑎1
(0)
𝑥2= 𝑎2
(0)
𝒛 𝟏
(𝟏)
𝒂 𝟏
(𝟏)
𝒛 𝟐
(𝟏)
𝒂 𝟐
(𝟏)
𝒛 𝟑
(𝟏)
𝒂 𝟑
(𝟏)
layer (1)
𝒛 𝟏
(𝟐)
𝒂 𝟏
(𝟐)
𝒛 𝟐
(𝟐)
𝒂 𝟐
(𝟐)
layer (2)
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟏𝟑
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟐𝟑
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟑𝟑
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟏𝟐
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒘 𝟐𝟐
(𝟐)
𝒘 𝟑𝟏
(𝟐)
𝒘 𝟑𝟐
(𝟐)
𝒚 𝟏
𝒚 𝟐
𝑥2= 𝑎2
(0)
Gradient Descent
and Backpropagation
Error Function
● how we can evaluate performance of a neuron
????
● We can use a Error function (or cost function or
loss function) to measure how far off we are from
the expected value.
● Choosing appropriate Error function help the
learning algorithm to reach to best values for
weights and biases.
● We’ll use the following variables:
○ D to represent the true value (desired value)
○ y to represent neuron’s prediction 9
Error Functions
(Cost function or Lost Function)
• There are many formulates for error functions.
• In this course, we will deal with two Error function
formulas.
Sum Squared Error (SSE) :
𝑒 𝑝𝑗 = 𝑦𝑗 − 𝐷𝑗
2
for single perceptron
𝐸𝑆𝑆𝐸=
𝑗=1
𝑛
𝑦𝑗 − 𝐷𝑗
2
1
Cross entropy (CE):
𝐸 𝐶𝐸 =
1
𝑛 𝑗=1
𝑛
[𝐷𝑗 ∗ ln(𝑦𝑗) + (1− 𝐷𝑗) ∗ ln(1− 𝑦𝑗)] (2)
10
1
2
Why the error in ANN occurs?
• Each weight and bias in the network contribute in
the occasion of the error.
• To solve this we need:
• A cost function or error function to compute the error.
(SSE or CE Error function)
• An optimization algorithm to minimize the error
function. (Gradient Decent)
• A learning algorithm to modify weights and biases to
new values to get the error down. (Backpropagation)
• Repeat this operation until find the best solution
11
Gradient Decent (in 1 dimension)
• Assume we have a error function E and we need to
use it to update one weight w
• The figure show the error function in terms of w
• Our target is to learn the value of w produces the
minimum value of E.
How?
12
E
W
minimum
Gradient Decent (in 1 dimension)
• In Gradient Decent algorithm, we use the following
equation to get a better value of w:
𝑤 = 𝑤 − αΔ𝑤 (called Delta rule)
Where:
α : is the learning rate
Δ𝑤 : is mathematically can be computed using
derivative of E with respect to w (
𝑑𝐸
𝑑𝑤
)
13
E
W
minimum
𝑤 = 𝑤 − α
𝑑𝐸
𝑑𝑤
(3)
Local Minima problem
14
Choosing learning rate
15
Gradient Decent (multi dimension)
• In ANN with many layers and many neurons in each layer the
Error function will be multi-variable function.
• So, the derivative in equation (3) should be partial derivative
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
(4)
• We write equation (4) as :
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝑤𝑖𝑗
• Same process will be use to get the
new bias value:
𝑏𝑗= 𝑏𝑗 − α 𝜕𝑏𝑗
16
derivative of activation functions
17
Sigmoid
Learning Rule in the output layer
using SSE as error function and sigmoid as Activation
function
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
=
𝜕𝐸 𝑗
𝜕𝑎 𝑗
(𝑙) *
𝜕𝑎 𝑗
(𝑙)
𝜕𝑧 𝑗
(𝑙) *
𝜕𝑧 𝑗
(𝑙)
𝜕𝑤𝑖𝑗
(𝑙)
Where:
𝐸𝑗 =
𝑗
(𝑦𝑗 − 𝐷𝑗 )2
𝑦𝑗 = 𝑎𝑗
(𝑙)
= 𝑓 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
𝑧𝑗
(𝑙)
=
𝑗
𝑤𝑖𝑗
(𝑙)
𝑎𝑖
(𝑙−1)
+ 𝑏𝑗
(𝑙)
From the previous table:
𝜎′ 𝑧𝑗
𝑙
= 𝜎 𝑧𝑗
𝑙
∗ 1 − 𝜎 𝑧𝑗
𝑙
= 𝑦𝑗 (1 − 𝑦𝑗)
18
Learning Rule in the output layer (cont.)
So (How?),
𝜕𝑦𝑗
𝜕𝑧𝑖
= 𝑦𝑗 (1 − 𝑦𝑗)
𝜕𝑧𝑗
𝜕𝑤𝑖𝑗
= 𝑎𝑖
(𝑙−1)
𝜕𝐸𝑗
𝜕𝑦𝑗
= −2(𝑦𝑗 − 𝐷𝑗 )
• Then:
𝜕𝐸 𝑗
𝜕𝑤 𝑖𝑗
= 2𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
𝑤𝑖𝑗 = 𝑤𝑖𝑗 − 2 α 𝑎𝑖
𝑙−1
𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗
19
Learning Rule in the Hidden layer
• Now we have to determine the appropriate
weight change for an input to hidden weight.
• This is more complicated because it depends on
the error at all of the nodes this weighted
connection can lead to.
• The mathematical proof is out our scope.
20
Gradient Decent (Notes)
Note 1:
• the neuron activation function (f ) should be is defined
and differentiable function.
Note 3:
• The calculating of 𝜕𝑤𝑖𝑗 for the hidden layer will be
more difficult (Why?)
Note 2:
• The previous calculation will be repeated for each
weight and for each bias in the ANN
• So, we need big computational power (what about
deeper networks? )
21
Gradient Decent (Notes)
• 𝜕𝑤𝑖𝑗 is represent the change in the values of 𝑤𝑖𝑗
to get better output
• The equation of 𝜕𝑤𝑖𝑗 is dependent on the choosing
of the Error(Cost) function and activation function.
• Gradient Decent algorithm help in calculated the
new values of weights and bias.
• Question: is one iteration (one trail) enough to
bet the best values for weights and biases
• Answer: No, we need a extended version ?
Backpropagation
22
How Backpropagation Work?
23
𝒘 𝟏𝟏
(𝟏)
𝒘 𝟏𝟐
(𝟏)
𝒘 𝟐𝟏
(𝟏)
𝒘 𝟐𝟐
(𝟏)
𝒘 𝟑𝟏
(𝟏)
𝒘 𝟑𝟐
(𝟏)
𝒘 𝟏𝟏
(𝟐)
𝒘 𝟐𝟏
(𝟐)
𝒚
𝒂 𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟏)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟏)
= 𝒘 𝟏𝟏
(𝟐)
-𝛼 𝜕𝒘 𝟏𝟏
(𝟐)
𝑭𝒐𝒓𝒘𝒂𝒓𝒅 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏
𝒍𝒂𝒚𝒆𝒓 𝟎 𝒍𝒂𝒚𝒆𝒓 𝟏 𝒍𝒂𝒚𝒆𝒓 𝟐
Online Learning vs. Offline Learning
• Online: Pattern-by-Pattern
learning
• Error calculated for each
pattern
• Weights updated after each
individual pattern
𝚫𝒘𝒊𝒋 = −𝜶
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
• Offline: Batch learning
• Error calculated for all
patterns
• Weights updated once at
the end of each epoch
𝚫𝒘𝒊𝒋 = −𝜶
𝒑
𝝏𝑬 𝒑
𝝏𝒘𝒊𝒋
24
Choosing Appropriate Activation and Cost
Functions
• We already know consideration of single layer networks what
output activation and cost functions should be used for
particular problem types.
• We have also seen that non-linear hidden unit activations are
needed, such as sigmoids.
• So we can summarize the required network properties:
• Regression/ Function Approximation Problems
• SSE cost function, linear output activations, sigmoid hidden activations
• Classification Problems (2 classes, 1 output)
• CE cost function, sigmoid output and hidden activations
• Classification Problems (multiple-classes, 1 output per class)
• CE cost function, softmax outputs, sigmoid hidden activations
• In each case, application of the gradient descent learning
algorithm (by computing the partial derivatives) leads to
appropriate back-propagation weight update equations.
25
Overall picture : learning process on ANN
26
Neural network simulator
• Search through the internet to find a simulator and
report it
For example:
• https://www.mladdict.com/neural-network-
simulator
• http://playground.tensorflow.org/
27

04 Multi-layer Feedforward Networks

  • 1.
    Neural Networks and FuzzySystems Multi-layer Feed forward Networks Dr. Tamer Ahmed Farrag Course No.: 803522-3
  • 2.
    Course Outline Part I: Neural Networks (11 weeks) • Introduction to Machine Learning • Fundamental Concepts of Artificial Neural Networks (ANN) • Single layer Perception Classifier • Multi-layer Feed forward Networks • Single layer FeedBack Networks • Unsupervised learning Part II : Fuzzy Systems (4 weeks) • Fuzzy set theory • Fuzzy Systems 2
  • 3.
    Outline • Why weneed Multi-layer Feed forward Networks (MLFF)? • Error Function (or Cost Function or Loss function) • Gradient Descent • Backpropagation 3
  • 4.
    Why we needMulti-layer Feed forward Networks (MLFF)? • Overcoming failure of single layer perceptron in solving nonlinear problems. • First Suggestion: • Divide the problem space into smaller linearly separable regions • Use a perceptron for each linearly separable region • Combine the output of multiple hidden neurons to produce a final decision neuron. 4 Region 1 Region 2
  • 5.
    Why we needMulti-layer Feed forward Networks (MLFF)? • Second suggestion • In some cases we need a curve decision boundary or we try to solve more complicated classification and regression problems. • So, we need to: • Add more layers • Increase a number of neurons in each layer. • Use non linear activation function in the hidden layers. • So , we need Multi-layer Feed forward Networks (MLFF). 5
  • 6.
    Notation for Multi-LayerNetworks • Dealing with multi-layer networks is easy if a sensible notation is adopted. • We simply need another label (n) to tell us which layer in the network we are dealing with. • Each unit j in layer n receives activations 𝑜𝑢𝑡𝑖 (𝑛−1) 𝑤𝑖𝑗 (𝑛) from the previous layer of processing units and sends activations 𝑜𝑢𝑡𝑗 (𝑛) to the next layer of units. 6 1 2 3 1 2 layer (0) layer (1) 𝒘𝒊𝒋 (𝟏) layer (n-1) layer (n) 𝒘𝒊𝒋 (𝒏)
  • 7.
    ANN Representation (1 inputlayer + 1 hidden layer +1 output layer) 7 for example: 𝑧1 (1) = (𝑤11 (1) 𝑥1 + 𝑤21 (1) 𝑥2+ 𝑤31 (1) 𝑥3 + 𝑏1 (1) ) 𝑎1 (1) = 𝑓 𝑧1 (1) =σ (𝑧1 (1) ) 𝑧2 (2) = (𝑤12 (2) 𝑎1 (1) + 𝑤22 (2) 𝑎2 (1) + 𝑤32 (2) 𝑎3 (1) + 𝑏2 (2) ) 𝑦2 = 𝑎2 (2) = 𝑓 𝑧2 (2) =σ (𝑧2 (2) ) 𝒛𝒋 (𝒍) = 𝒋 𝒘𝒊𝒋 (𝒍) 𝒂𝒊 (𝒍−𝟏) + 𝒃𝒋 (𝒍) 𝒂𝒋 (𝒍) = 𝒇 𝒛𝒋 𝒍 = σ 𝒛𝒋 𝒍 layer (0) 𝑥1= 𝑎1 (0) 𝑥2= 𝑎2 (0) 𝒛 𝟏 (𝟏) 𝒂 𝟏 (𝟏) 𝒛 𝟐 (𝟏) 𝒂 𝟐 (𝟏) 𝒛 𝟑 (𝟏) 𝒂 𝟑 (𝟏) layer (1) 𝒛 𝟏 (𝟐) 𝒂 𝟏 (𝟐) 𝒛 𝟐 (𝟐) 𝒂 𝟐 (𝟐) layer (2) 𝒘 𝟏𝟏 (𝟏) 𝒘 𝟏𝟐 (𝟏) 𝒘 𝟏𝟑 (𝟏) 𝒘 𝟐𝟏 (𝟏) 𝒘 𝟐𝟐 (𝟏) 𝒘 𝟐𝟑 (𝟏) 𝒘 𝟑𝟏 (𝟏) 𝒘 𝟑𝟐 (𝟏) 𝒘 𝟑𝟑 (𝟏) 𝒘 𝟏𝟏 (𝟐) 𝒘 𝟏𝟐 (𝟐) 𝒘 𝟐𝟏 (𝟐) 𝒘 𝟐𝟐 (𝟐) 𝒘 𝟑𝟏 (𝟐) 𝒘 𝟑𝟐 (𝟐) 𝒚 𝟏 𝒚 𝟐 𝑥2= 𝑎2 (0)
  • 8.
  • 9.
    Error Function ● howwe can evaluate performance of a neuron ???? ● We can use a Error function (or cost function or loss function) to measure how far off we are from the expected value. ● Choosing appropriate Error function help the learning algorithm to reach to best values for weights and biases. ● We’ll use the following variables: ○ D to represent the true value (desired value) ○ y to represent neuron’s prediction 9
  • 10.
    Error Functions (Cost functionor Lost Function) • There are many formulates for error functions. • In this course, we will deal with two Error function formulas. Sum Squared Error (SSE) : 𝑒 𝑝𝑗 = 𝑦𝑗 − 𝐷𝑗 2 for single perceptron 𝐸𝑆𝑆𝐸= 𝑗=1 𝑛 𝑦𝑗 − 𝐷𝑗 2 1 Cross entropy (CE): 𝐸 𝐶𝐸 = 1 𝑛 𝑗=1 𝑛 [𝐷𝑗 ∗ ln(𝑦𝑗) + (1− 𝐷𝑗) ∗ ln(1− 𝑦𝑗)] (2) 10 1 2
  • 11.
    Why the errorin ANN occurs? • Each weight and bias in the network contribute in the occasion of the error. • To solve this we need: • A cost function or error function to compute the error. (SSE or CE Error function) • An optimization algorithm to minimize the error function. (Gradient Decent) • A learning algorithm to modify weights and biases to new values to get the error down. (Backpropagation) • Repeat this operation until find the best solution 11
  • 12.
    Gradient Decent (in1 dimension) • Assume we have a error function E and we need to use it to update one weight w • The figure show the error function in terms of w • Our target is to learn the value of w produces the minimum value of E. How? 12 E W minimum
  • 13.
    Gradient Decent (in1 dimension) • In Gradient Decent algorithm, we use the following equation to get a better value of w: 𝑤 = 𝑤 − αΔ𝑤 (called Delta rule) Where: α : is the learning rate Δ𝑤 : is mathematically can be computed using derivative of E with respect to w ( 𝑑𝐸 𝑑𝑤 ) 13 E W minimum 𝑤 = 𝑤 − α 𝑑𝐸 𝑑𝑤 (3)
  • 14.
  • 15.
  • 16.
    Gradient Decent (multidimension) • In ANN with many layers and many neurons in each layer the Error function will be multi-variable function. • So, the derivative in equation (3) should be partial derivative 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 (4) • We write equation (4) as : 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − α 𝜕𝑤𝑖𝑗 • Same process will be use to get the new bias value: 𝑏𝑗= 𝑏𝑗 − α 𝜕𝑏𝑗 16
  • 17.
    derivative of activationfunctions 17 Sigmoid
  • 18.
    Learning Rule inthe output layer using SSE as error function and sigmoid as Activation function 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 = 𝜕𝐸 𝑗 𝜕𝑎 𝑗 (𝑙) * 𝜕𝑎 𝑗 (𝑙) 𝜕𝑧 𝑗 (𝑙) * 𝜕𝑧 𝑗 (𝑙) 𝜕𝑤𝑖𝑗 (𝑙) Where: 𝐸𝑗 = 𝑗 (𝑦𝑗 − 𝐷𝑗 )2 𝑦𝑗 = 𝑎𝑗 (𝑙) = 𝑓 𝑧𝑗 𝑙 = 𝜎 𝑧𝑗 𝑙 𝑧𝑗 (𝑙) = 𝑗 𝑤𝑖𝑗 (𝑙) 𝑎𝑖 (𝑙−1) + 𝑏𝑗 (𝑙) From the previous table: 𝜎′ 𝑧𝑗 𝑙 = 𝜎 𝑧𝑗 𝑙 ∗ 1 − 𝜎 𝑧𝑗 𝑙 = 𝑦𝑗 (1 − 𝑦𝑗) 18
  • 19.
    Learning Rule inthe output layer (cont.) So (How?), 𝜕𝑦𝑗 𝜕𝑧𝑖 = 𝑦𝑗 (1 − 𝑦𝑗) 𝜕𝑧𝑗 𝜕𝑤𝑖𝑗 = 𝑎𝑖 (𝑙−1) 𝜕𝐸𝑗 𝜕𝑦𝑗 = −2(𝑦𝑗 − 𝐷𝑗 ) • Then: 𝜕𝐸 𝑗 𝜕𝑤 𝑖𝑗 = 2𝑎𝑖 𝑙−1 𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗 𝑤𝑖𝑗 = 𝑤𝑖𝑗 − 2 α 𝑎𝑖 𝑙−1 𝑦𝑗 − 𝐷𝑗 𝑦𝑗 1 − 𝑦𝑗 19
  • 20.
    Learning Rule inthe Hidden layer • Now we have to determine the appropriate weight change for an input to hidden weight. • This is more complicated because it depends on the error at all of the nodes this weighted connection can lead to. • The mathematical proof is out our scope. 20
  • 21.
    Gradient Decent (Notes) Note1: • the neuron activation function (f ) should be is defined and differentiable function. Note 3: • The calculating of 𝜕𝑤𝑖𝑗 for the hidden layer will be more difficult (Why?) Note 2: • The previous calculation will be repeated for each weight and for each bias in the ANN • So, we need big computational power (what about deeper networks? ) 21
  • 22.
    Gradient Decent (Notes) •𝜕𝑤𝑖𝑗 is represent the change in the values of 𝑤𝑖𝑗 to get better output • The equation of 𝜕𝑤𝑖𝑗 is dependent on the choosing of the Error(Cost) function and activation function. • Gradient Decent algorithm help in calculated the new values of weights and bias. • Question: is one iteration (one trail) enough to bet the best values for weights and biases • Answer: No, we need a extended version ? Backpropagation 22
  • 23.
    How Backpropagation Work? 23 𝒘𝟏𝟏 (𝟏) 𝒘 𝟏𝟐 (𝟏) 𝒘 𝟐𝟏 (𝟏) 𝒘 𝟐𝟐 (𝟏) 𝒘 𝟑𝟏 (𝟏) 𝒘 𝟑𝟐 (𝟏) 𝒘 𝟏𝟏 (𝟐) 𝒘 𝟐𝟏 (𝟐) 𝒚 𝒂 𝟏 (𝟏) = 𝒘 𝟏𝟏 (𝟏) -𝛼 𝜕𝒘 𝟏𝟏 (𝟏) = 𝒘 𝟏𝟏 (𝟐) -𝛼 𝜕𝒘 𝟏𝟏 (𝟐) 𝑭𝒐𝒓𝒘𝒂𝒓𝒅 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝑩𝒂𝒄𝒌 𝑷𝒓𝒐𝒑𝒂𝒈𝒂𝒕𝒊𝒐𝒏 𝒍𝒂𝒚𝒆𝒓 𝟎 𝒍𝒂𝒚𝒆𝒓 𝟏 𝒍𝒂𝒚𝒆𝒓 𝟐
  • 24.
    Online Learning vs.Offline Learning • Online: Pattern-by-Pattern learning • Error calculated for each pattern • Weights updated after each individual pattern 𝚫𝒘𝒊𝒋 = −𝜶 𝝏𝑬 𝒑 𝝏𝒘𝒊𝒋 • Offline: Batch learning • Error calculated for all patterns • Weights updated once at the end of each epoch 𝚫𝒘𝒊𝒋 = −𝜶 𝒑 𝝏𝑬 𝒑 𝝏𝒘𝒊𝒋 24
  • 25.
    Choosing Appropriate Activationand Cost Functions • We already know consideration of single layer networks what output activation and cost functions should be used for particular problem types. • We have also seen that non-linear hidden unit activations are needed, such as sigmoids. • So we can summarize the required network properties: • Regression/ Function Approximation Problems • SSE cost function, linear output activations, sigmoid hidden activations • Classification Problems (2 classes, 1 output) • CE cost function, sigmoid output and hidden activations • Classification Problems (multiple-classes, 1 output per class) • CE cost function, softmax outputs, sigmoid hidden activations • In each case, application of the gradient descent learning algorithm (by computing the partial derivatives) leads to appropriate back-propagation weight update equations. 25
  • 26.
    Overall picture :learning process on ANN 26
  • 27.
    Neural network simulator •Search through the internet to find a simulator and report it For example: • https://www.mladdict.com/neural-network- simulator • http://playground.tensorflow.org/ 27