0% found this document useful (0 votes)
7 views65 pages

Chapter 5 ML

Uploaded by

adamaakif23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views65 pages

Chapter 5 ML

Uploaded by

adamaakif23
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 65

Machine Learning:

Neural Networks

Hajar Mousannif
[email protected]
Acknowledgment

This lecture is inspired by Pr. Andrew Ng’s


Machine Learning class on Coursera

2
Why do we need neural networks?
• Say we have a complex supervised learning
classification problem
– We can use logistic regression with many polynomial
terms
– It works well when you have 1-2 features

3
Why do we need neural networks?
• In a housing example with 100 house features, predict whether a
house will be sold in the next 6 months
– Here, if you included all the quadratic terms (second order)
• There are lots of them (x12 ,x1x2, x1x4 ..., x1x100)
• For the case of n = 100, you have about 5000 features
• Number of features grows O(n2)
• This would be computationally expensive to work with as a feature set
• If you include the cubic terms
– e.g. (x12x2, x1x2x3, x1x4x23 etc)
– There are even more features grows O(n3)
– About 170 000 features for n = 100!
• Not a good way to build classifiers when n is large

4
Example: Problems where n is large -
computer vision

Computer vision sees a matrix of pixel intensity values 5


Example: Problems where n is large -
computer vision

To build a car detector


• Build a training set of:
- Not cars
- Cars
• Then test against a car
6
Example: Problems where n is large -
computer vision

7
Example: Problems where n is large -
computer vision
• We need a non-linear hypothesis to separate the
classes
• Feature space:
– If we used 50 x 50 pixels --> 2500 pixels, so n = 2500
– If RGB then 7500
– If 100 x 100 RGB then --> 50 000 000 features
• Too big - way too big
– Logistic regression here is not appropriate for large complex
systems
– Neural networks are much better for a complex nonlinear
hypothesis even when feature space is huge
8
Neurons and the brain
• Neural networks (NNs) were originally motivated
by looking at machines which replicate the brain's
functionality
• Build learning systems that mimic the brain
• Used a lot in the 80s and 90s. Popularity
diminished in late 90s
• Recent major resurgence
– NNs are computationally expensive, so only recently
large scale neural networks became
computationally feasible
9
Neurons and the brain
• Auditory cortex --> takes sound signals
• If you cut the wiring from the ear to the auditory cortex
• Re-route optic nerve to the auditory cortex
• Auditory cortex learns to see
• Brain learns by itself how to learn
 The “one learning algorithm” hypothesis

10
Neurons and the brain

11
Model representation 1
• Three things to notice
– Cell body
– Number of input wires (dendrites)
– Output wire (axon)
• Simple level
– Neuron gets one or more inputs through
dendrites
– Does processing
– Sends output down axon
• Neurons communicate through electric
spikes
– Pulse of electricity via axon to another
neuron 12
Artificial neural network - representation
of a neuron
• In an artificial neural network, a neuron is a logistic unit
– Feed input via input wires
– Logistic unit does computation
– Sends output down output wires
• This is an artificial neuron with a sigmoid (logistic)
activation function

13
Artificial neural network – Model
representation

• Often good to include an x0 input - the bias unit (equal


to 1)
• Ɵ vector may also be called the weights of a model
• Below we have a group of neurons strung together

14
Artificial neural network – Model
representation
• Here, input is x1, x2 and x3
– We could also call input activation on the first layer -
i.e. (a11, a21 and a31 )
• First layer is the input layer
• Final layer is the output layer - produces value computed by a
hypothesis
• Middle layer(s) are called the hidden layers

15
Neural networks - notation
• ai(j) - activation of unit i in layer j
• Ɵ(j) - matrix of parameters controlling the function
mapping from layer j to layer j + 1
• If network has sj units in layer j and sj+1 units in layer j + 1 ,
then Ɵj will be of dimensions [sj+1 X sj + 1]

16
Neural networks – Model representation

The activation value on each hidden unit (e.g. a12 ) is equal to the
sigmoid function applied to the linear combination of inputs
– Three input units
• So Ɵ(1) is the matrix of parameters governing the mapping of the input units to
hidden units
– Ɵ(1) here is a [3 x 4] dimensional matrix
– Three hidden units
• Then Ɵ(2) is the matrix of parameters governing the mapping of the hidden layer
to the output layer
– Ɵ(2) here is a [1 x 4] dimensional matrix (i.e. a row vector)
– One output unit

17
Neural networks – Model representation

• Ɵabc
– a = ranges from 1 to the
number of units in layer c+1
– b = ranges from 0 to the
number of units in layer c
– c is the layer you're moving
FROM
For example Ɵ131 = means
1 - we're mapping to node 1 in layer 2
3 - we're mapping from node 3 in layer 1
1 - we're mapping from layer 1

18
Neural networks - Exercise
Compute the activation values on each layer

19
Neural networks - Solution
Example of network, with the associated calculations :

20
Model Representation II
Objective: carry out the computation efficiently through a
vectorized implementation.

- Some additional terms:


z12 = Ɵ101x0 + Ɵ111x1 + Ɵ121x2 + Ɵ131x3
z22 = Ɵ201x0 + Ɵ211x1 + Ɵ221x2 + Ɵ231x3
z32 = Ɵ301x0 + Ɵ311x1 + Ɵ321x2 + Ɵ331x3

- Activation values become:


a12 = g(z12)
a22 = g(z22)
a32 = g(z32)

21
Model Representation II
• We can vectorize the computation of the
neural network as as follows:
– z2 = Ɵ(1)x
– a2 = g(z(2))
• z2 is a 3x1 vector. a2 is also a 3x1 vector
• g() applies the sigmoid (logistic) function
element wise to each member of the z vector
2

22
Model Representation II
• To make the notation with input layer make
sense;
– a1 = x
• a1 is the activations in the input layer
• Obviously the "activation" for the input layer is just the
input!
– a1 is the vector of inputs
– a2 is the vector of values calculated by the g(z2) function

• We need to calculate a for the final


0
2

hypothesis calculation
23
Model Representation II
This process is called
forward propagation
– Start off with activations
of input unit
• i.e. the x vector as input
– Forward propagate and
calculate the activation
of each layer
sequentially
– This is a vectorized
version of this
implementation
24
Neural networks learning its own features

• Diagram below looks a lot like logistic regression


• Layer 3 is a logistic regression node. The hypothesis output =
g(Ɵ102 a02 + Ɵ112 a12 + Ɵ122 a22 + Ɵ132 a32)
• This is just logistic regression
– The only difference is, instead of input a feature vector, the features are
just values calculated by the hidden layer

25
Neural learning its own features
• The features a12, a22, and a32 are calculated/learned - not
original features
• Mapping from layer 1 to layer 2 (i.e. the calculations which
generate the a2 features) is determined by another set of
parameters - Ɵ1
• So instead of being constrained by the original input
features, a neural network can learn its own features to
feed into logistic regression
• if we compare this to previous logistic regression, you
would have to calculate your own features to define the
best way to classify or describe something
26
Other architectures
• other architectures (topology) are possible:
– More/less nodes per layer
– More layers

27
Practice 1
Compute a1(3)

28
Practice 1 - Solution
Compute a1(3)

29
Practice 2

30
Practice 2 - Solution

31
Practice 3

32
Practice 3 - Solution

33
Neural Network example 1: AND function

• Ɵ101 = -30
• Ɵ111 = 20
• Ɵ121 = 20

34
Neural Network example 1: NOT function

• Ɵ101 = 10
• Ɵ111 = -20
• Negation is achieved by putting a large
negative weight in front of the variable you
want to negative

35
Neural Network example 3: XNOR function

• XNOR is short for NOT XOR, i.e. NOT an exclusive or


• XNOR is :
x1 X2 XNOR
0 0 1
0 1 0
1 0 0
1 1 1

• Can you find a Neural Network representation of the XNOR


function?
• Hint: structure the network so the input which produces a
positive output are:
– AND (i.e. both true)
OR 36
Neural Network example 3: XNOR function

37
Practice 1

38
Practice 1- Solution

39
Practice 2

40
Practice 2- Solution

41
Multiclass classification
• Multiclass classification is when you distiguish
between more than two categories.
• Example: recognizing pedestrian, car, motorbike, or
truck requires building a neural network with four
output units.
• Previously we had written y as an integer {1,2,3,4}
• Now represent y is a vector of four numbers

42
Neural network cost function
• Focus on application of NNs for classification problems
• Training set is {(x1, y1), (x2, y2), (x3, y3) ... (xm, ym)
• L = number of layers in the network
• sl = number of units (not counting bias unit) in layer l

l =4
s1 = 3
s2 = 5
s3 = 5
s4 = 4 43
Cost function for neural networks
• The (regularized) logistic regression cost function is
as follows:

• For neural networks our cost function is a


generalization of the equation above
• Instead of one output we generate k outputs

44
Cost function for neural networks
• We want to find parameters Ɵ which minimize J(Ɵ)

• To do so we can use one of the algorithms already


described such as:
– Gradient descent
– Advanced optimization algorithms
• For this, we need to compute: J(Ɵ) and
45
Remember
• Ɵ(j)
is of dimensions [sj+1 X sj + 1]
– network has sj units in layer j and sj+1 units in layer
j+1
• The partial derivative term is a REAL number (not a
vector or a matrix):

• The partial derivative term is the partial derivative


of a 3-way indexed dataset with respect to a real
number
• How to compute this partial derivative term? 46
Gradient Computation
• We've already described forward propagation
• This is the algorithm which takes your neural network and
the initial input and pushes the input through the network

47
Back propagation Algorithm
• Back propagation takes the output you got from your network, compares
it to the real value (y) and calculates how wrong the parameters were
• Using the calculated error, it back-calculates the error associated with
each unit from the preceding layer
• This goes on until you reach the input layer (where obviously there is no
error)
• These "error" measurements for each unit can be used to calculate
the partial derivatives

48
Back propagation Algorithm
• For each node we calculate (δjl) - this is the error of node j in layer l
• We can first calculate δj4 = aj4 - yj
=[Activation of the unit] - [the actual value observed in the training example]
• Instead of focusing on each node, let's think about this as a
vectorized problem: δ4 = a4 - y
– δ4 is the vector of errors for the 4th layer
– a4 is the vector of activation values for the 4th layer

49
Back propagation Algorithm
• With δ4 calculated, we can determine the error terms for the other
layers:

• If we do the calculus: g'(z(3)) = a(3) . * (1 - a(3))


• So, more easily: δ(3) = (Ɵ(3))T δ(4) . *(a(3) . * (1 - a(3)))
. * is the element wise multiplication between the two vectors
• If we ignore regularization
(and through a very complicated
derivation ), we get:

50
Putting it all together !

When j = 0 we have no regularization term 51


Back propagation intuition
In the example, we will use two features: x1 and x2

52
Back propagation intuition
With our input data present we use forward
propagation

53
Back propagation intuition
• The sigmoid function applied to the z values gives the
activation values.
• Below we show exactly how the z value is calculated for an
example

54
Back propagation intuition
• Back propagation is doing something very similar to forward
propagation, but backwards
• Below we have the cost function if there is a single output (i.e.
binary classification)

• This function cycles over each example, so the cost for one
example really boils down to this:

• We can think about a δ term on a unit as the "error" of cost for


the activation value associated with a unit

55
Back propagation intuition
• So for the output layer, back propagation sets the δ value
as [a - y]
– Difference between activation and actual value
• We then propagate these values backwards

56
Back propagation intuition
Looking at another example to see how we actually calculate
the delta value

57
Practice 1

58
Practice 1- Solution

59
Practice 2

60
Practice 2- Solution

61
Practice 3

62
Practice 3 - Solution

63
How to tune the weights? (Learning)
Implementation in Python

A step-by-Step tutorial:
https://machinelearningmastery.com/
implement-backpropagation-algorithm-
scratch-python/

65

You might also like