Machine Learning:
Neural Networks
Hajar Mousannif
[email protected]
Acknowledgment
This lecture is inspired by Pr. Andrew Ng’s
Machine Learning class on Coursera
2
Why do we need neural networks?
• Say we have a complex supervised learning
classification problem
– We can use logistic regression with many polynomial
terms
– It works well when you have 1-2 features
3
Why do we need neural networks?
• In a housing example with 100 house features, predict whether a
house will be sold in the next 6 months
– Here, if you included all the quadratic terms (second order)
• There are lots of them (x12 ,x1x2, x1x4 ..., x1x100)
• For the case of n = 100, you have about 5000 features
• Number of features grows O(n2)
• This would be computationally expensive to work with as a feature set
• If you include the cubic terms
– e.g. (x12x2, x1x2x3, x1x4x23 etc)
– There are even more features grows O(n3)
– About 170 000 features for n = 100!
• Not a good way to build classifiers when n is large
4
Example: Problems where n is large -
computer vision
Computer vision sees a matrix of pixel intensity values 5
Example: Problems where n is large -
computer vision
To build a car detector
• Build a training set of:
- Not cars
- Cars
• Then test against a car
6
Example: Problems where n is large -
computer vision
7
Example: Problems where n is large -
computer vision
• We need a non-linear hypothesis to separate the
classes
• Feature space:
– If we used 50 x 50 pixels --> 2500 pixels, so n = 2500
– If RGB then 7500
– If 100 x 100 RGB then --> 50 000 000 features
• Too big - way too big
– Logistic regression here is not appropriate for large complex
systems
– Neural networks are much better for a complex nonlinear
hypothesis even when feature space is huge
8
Neurons and the brain
• Neural networks (NNs) were originally motivated
by looking at machines which replicate the brain's
functionality
• Build learning systems that mimic the brain
• Used a lot in the 80s and 90s. Popularity
diminished in late 90s
• Recent major resurgence
– NNs are computationally expensive, so only recently
large scale neural networks became
computationally feasible
9
Neurons and the brain
• Auditory cortex --> takes sound signals
• If you cut the wiring from the ear to the auditory cortex
• Re-route optic nerve to the auditory cortex
• Auditory cortex learns to see
• Brain learns by itself how to learn
The “one learning algorithm” hypothesis
10
Neurons and the brain
11
Model representation 1
• Three things to notice
– Cell body
– Number of input wires (dendrites)
– Output wire (axon)
• Simple level
– Neuron gets one or more inputs through
dendrites
– Does processing
– Sends output down axon
• Neurons communicate through electric
spikes
– Pulse of electricity via axon to another
neuron 12
Artificial neural network - representation
of a neuron
• In an artificial neural network, a neuron is a logistic unit
– Feed input via input wires
– Logistic unit does computation
– Sends output down output wires
• This is an artificial neuron with a sigmoid (logistic)
activation function
13
Artificial neural network – Model
representation
• Often good to include an x0 input - the bias unit (equal
to 1)
• Ɵ vector may also be called the weights of a model
• Below we have a group of neurons strung together
14
Artificial neural network – Model
representation
• Here, input is x1, x2 and x3
– We could also call input activation on the first layer -
i.e. (a11, a21 and a31 )
• First layer is the input layer
• Final layer is the output layer - produces value computed by a
hypothesis
• Middle layer(s) are called the hidden layers
15
Neural networks - notation
• ai(j) - activation of unit i in layer j
• Ɵ(j) - matrix of parameters controlling the function
mapping from layer j to layer j + 1
• If network has sj units in layer j and sj+1 units in layer j + 1 ,
then Ɵj will be of dimensions [sj+1 X sj + 1]
16
Neural networks – Model representation
The activation value on each hidden unit (e.g. a12 ) is equal to the
sigmoid function applied to the linear combination of inputs
– Three input units
• So Ɵ(1) is the matrix of parameters governing the mapping of the input units to
hidden units
– Ɵ(1) here is a [3 x 4] dimensional matrix
– Three hidden units
• Then Ɵ(2) is the matrix of parameters governing the mapping of the hidden layer
to the output layer
– Ɵ(2) here is a [1 x 4] dimensional matrix (i.e. a row vector)
– One output unit
17
Neural networks – Model representation
• Ɵabc
– a = ranges from 1 to the
number of units in layer c+1
– b = ranges from 0 to the
number of units in layer c
– c is the layer you're moving
FROM
For example Ɵ131 = means
1 - we're mapping to node 1 in layer 2
3 - we're mapping from node 3 in layer 1
1 - we're mapping from layer 1
18
Neural networks - Exercise
Compute the activation values on each layer
19
Neural networks - Solution
Example of network, with the associated calculations :
20
Model Representation II
Objective: carry out the computation efficiently through a
vectorized implementation.
- Some additional terms:
z12 = Ɵ101x0 + Ɵ111x1 + Ɵ121x2 + Ɵ131x3
z22 = Ɵ201x0 + Ɵ211x1 + Ɵ221x2 + Ɵ231x3
z32 = Ɵ301x0 + Ɵ311x1 + Ɵ321x2 + Ɵ331x3
- Activation values become:
a12 = g(z12)
a22 = g(z22)
a32 = g(z32)
21
Model Representation II
• We can vectorize the computation of the
neural network as as follows:
– z2 = Ɵ(1)x
– a2 = g(z(2))
• z2 is a 3x1 vector. a2 is also a 3x1 vector
• g() applies the sigmoid (logistic) function
element wise to each member of the z vector
2
22
Model Representation II
• To make the notation with input layer make
sense;
– a1 = x
• a1 is the activations in the input layer
• Obviously the "activation" for the input layer is just the
input!
– a1 is the vector of inputs
– a2 is the vector of values calculated by the g(z2) function
• We need to calculate a for the final
0
2
hypothesis calculation
23
Model Representation II
This process is called
forward propagation
– Start off with activations
of input unit
• i.e. the x vector as input
– Forward propagate and
calculate the activation
of each layer
sequentially
– This is a vectorized
version of this
implementation
24
Neural networks learning its own features
• Diagram below looks a lot like logistic regression
• Layer 3 is a logistic regression node. The hypothesis output =
g(Ɵ102 a02 + Ɵ112 a12 + Ɵ122 a22 + Ɵ132 a32)
• This is just logistic regression
– The only difference is, instead of input a feature vector, the features are
just values calculated by the hidden layer
25
Neural learning its own features
• The features a12, a22, and a32 are calculated/learned - not
original features
• Mapping from layer 1 to layer 2 (i.e. the calculations which
generate the a2 features) is determined by another set of
parameters - Ɵ1
• So instead of being constrained by the original input
features, a neural network can learn its own features to
feed into logistic regression
• if we compare this to previous logistic regression, you
would have to calculate your own features to define the
best way to classify or describe something
26
Other architectures
• other architectures (topology) are possible:
– More/less nodes per layer
– More layers
27
Practice 1
Compute a1(3)
28
Practice 1 - Solution
Compute a1(3)
29
Practice 2
30
Practice 2 - Solution
31
Practice 3
32
Practice 3 - Solution
33
Neural Network example 1: AND function
• Ɵ101 = -30
• Ɵ111 = 20
• Ɵ121 = 20
34
Neural Network example 1: NOT function
• Ɵ101 = 10
• Ɵ111 = -20
• Negation is achieved by putting a large
negative weight in front of the variable you
want to negative
35
Neural Network example 3: XNOR function
• XNOR is short for NOT XOR, i.e. NOT an exclusive or
• XNOR is :
x1 X2 XNOR
0 0 1
0 1 0
1 0 0
1 1 1
• Can you find a Neural Network representation of the XNOR
function?
• Hint: structure the network so the input which produces a
positive output are:
– AND (i.e. both true)
OR 36
Neural Network example 3: XNOR function
37
Practice 1
38
Practice 1- Solution
39
Practice 2
40
Practice 2- Solution
41
Multiclass classification
• Multiclass classification is when you distiguish
between more than two categories.
• Example: recognizing pedestrian, car, motorbike, or
truck requires building a neural network with four
output units.
• Previously we had written y as an integer {1,2,3,4}
• Now represent y is a vector of four numbers
42
Neural network cost function
• Focus on application of NNs for classification problems
• Training set is {(x1, y1), (x2, y2), (x3, y3) ... (xm, ym)
• L = number of layers in the network
• sl = number of units (not counting bias unit) in layer l
l =4
s1 = 3
s2 = 5
s3 = 5
s4 = 4 43
Cost function for neural networks
• The (regularized) logistic regression cost function is
as follows:
• For neural networks our cost function is a
generalization of the equation above
• Instead of one output we generate k outputs
44
Cost function for neural networks
• We want to find parameters Ɵ which minimize J(Ɵ)
• To do so we can use one of the algorithms already
described such as:
– Gradient descent
– Advanced optimization algorithms
• For this, we need to compute: J(Ɵ) and
45
Remember
• Ɵ(j)
is of dimensions [sj+1 X sj + 1]
– network has sj units in layer j and sj+1 units in layer
j+1
• The partial derivative term is a REAL number (not a
vector or a matrix):
• The partial derivative term is the partial derivative
of a 3-way indexed dataset with respect to a real
number
• How to compute this partial derivative term? 46
Gradient Computation
• We've already described forward propagation
• This is the algorithm which takes your neural network and
the initial input and pushes the input through the network
47
Back propagation Algorithm
• Back propagation takes the output you got from your network, compares
it to the real value (y) and calculates how wrong the parameters were
• Using the calculated error, it back-calculates the error associated with
each unit from the preceding layer
• This goes on until you reach the input layer (where obviously there is no
error)
• These "error" measurements for each unit can be used to calculate
the partial derivatives
48
Back propagation Algorithm
• For each node we calculate (δjl) - this is the error of node j in layer l
• We can first calculate δj4 = aj4 - yj
=[Activation of the unit] - [the actual value observed in the training example]
• Instead of focusing on each node, let's think about this as a
vectorized problem: δ4 = a4 - y
– δ4 is the vector of errors for the 4th layer
– a4 is the vector of activation values for the 4th layer
49
Back propagation Algorithm
• With δ4 calculated, we can determine the error terms for the other
layers:
• If we do the calculus: g'(z(3)) = a(3) . * (1 - a(3))
• So, more easily: δ(3) = (Ɵ(3))T δ(4) . *(a(3) . * (1 - a(3)))
. * is the element wise multiplication between the two vectors
• If we ignore regularization
(and through a very complicated
derivation ), we get:
50
Putting it all together !
When j = 0 we have no regularization term 51
Back propagation intuition
In the example, we will use two features: x1 and x2
52
Back propagation intuition
With our input data present we use forward
propagation
53
Back propagation intuition
• The sigmoid function applied to the z values gives the
activation values.
• Below we show exactly how the z value is calculated for an
example
54
Back propagation intuition
• Back propagation is doing something very similar to forward
propagation, but backwards
• Below we have the cost function if there is a single output (i.e.
binary classification)
• This function cycles over each example, so the cost for one
example really boils down to this:
• We can think about a δ term on a unit as the "error" of cost for
the activation value associated with a unit
55
Back propagation intuition
• So for the output layer, back propagation sets the δ value
as [a - y]
– Difference between activation and actual value
• We then propagate these values backwards
56
Back propagation intuition
Looking at another example to see how we actually calculate
the delta value
57
Practice 1
58
Practice 1- Solution
59
Practice 2
60
Practice 2- Solution
61
Practice 3
62
Practice 3 - Solution
63
How to tune the weights? (Learning)
Implementation in Python
A step-by-Step tutorial:
https://machinelearningmastery.com/
implement-backpropagation-algorithm-
scratch-python/
65