0% found this document useful (0 votes)

145 views121 pages

Introduction To Feed Forward Neural Networks

Uploaded by

张立波

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

145 views121 pages

Introduction To Feed Forward Neural Networks

Uploaded by

张立波

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 121

Introduction to Feed Forward Neural Networks

Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Wolfgang Ecker

Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Optimization
Optimization

This image is CC0 1.0 public

domain
Optimization

Walking man image is CC0 1.0 public

domain
Optimization
Strategy #1: A first very bad idea solution:
Random search
Optimization
Lets see how well this works on the
test set...

15.5% accuracy! not

bad! (SOTA is ~95%)
Optimization

Strategy #2: Follow the

slope
Gradient

Strategy #2: Follow the slope

In 1-dimension, the derivative of a
function:

In multiple dimensions, the gradient is the vector of partial

derivatives along each dimension.

The slope in any direction is the dot product of the direction with
the gradient. The direction of steepest descent is the negative
gradient
Optimization Example
current W: gradient dW:

[0.34, [?,
-1.11, ?,
0.78, ?,
0.12, ?,
0.55, ?,
2.81, ?,
-3.1, ?,
-1.5, ?,
0.33,…] ?,…]
loss 1.25347
Optimization Example
current W: W + h (first dim): gradient dW:

[0.34, [0.34 + 0.0001, [?,

-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322
Optimization Example
current W: W + h (first dim): gradient dW:

[0.34, [0.34 + 0.0001, [-2.5,

-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?, (1.25322 - 1.25347)/0.0001
?, = -2.5
0.55, 0.55,
?,
2.81, 2.81,
?,
?,
-3.1, -3.1,
-1.5, -1.5, ?,…]
0.33,…] 0.33,…]
loss 1.25347 loss 1.25322
Optimization Example
current W: W+h gradient dW:
(second dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11 + 0.0001, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353
Optimization Example
current W: W + h (second gradient dW:
dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11 + 0.0001, 0.6,
0.78, 0.78, ?,
(1.25353 -
0.12, 0.12, ?, 1.25347)/0.0001
0.55, 0.55, ?, = 0.6
?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,…]
?,
0.33,…] 0.33,…]
loss 1.25347 loss 1.25353
Optimization Example
current W: W + h (third dim): gradient dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25347
Optimization Example
current W: W + h (third dim): gradient dW:

[0.34, [0.34, [-2.5,

-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0,
0.12, 0.12, ?,
?
(1.25347 -
0.55, 0.55,
,
1.25347)/0.0001
2.81, 2.81, =0 ?
-3.1, -3.1,
,
-1.5, -1.5, ?,…]
?
0.33,…] 0.33,…]
loss 1.25347 loss 1.25347
Optimization Example
current W: W + h (third gradient dW:
dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0,
0.12, 0.12, ?,
Numeric
,
0.55, 0.55, Gradient
?, Need to loop
› Slow!
2.81, 2.81,
?, all dimensions
over
-3.1, -3.1, › Approximate
?,
-1.5, -1.5, ?,…]
0.33,…] 0.33,…]
loss 1.25347 loss 1.25347
How to proceed

This is silly. The loss is just a function of W:

want
How to proceed
Hammer image is in the public
domain

This is silly. The loss is just a function of W:

want

Use calculus to compute

an
This image is in the public This image is in the public

analytic gradient
domain domain
Analytical Gradient
current W: gradient dW:

[0.34, [-2.5,
dW = ...
-1.11, 0.6,
0.78, (some function
0,
0.12, data and W)
0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, 1.1,
-1.5, 1.3,
0.33,…] -2.1,…]
loss
1.25347
Analytical Gradient

In summary:

› Numerical gradient: approximate, slow, easy

to write

› Analytic gradient: exact, fast, error-prone

In practice: Always use analytic gradient, but

check implementation with numerical gradient.
This is called a gradient check.
Analytical Gradient
Analytical Gradient

W_2

original
W
negative gradient W_1
direction
Stochastic Gradient Descent (SGD)

Full sum
expensive
when N is
large!

Approximate
sum using a
minibatch of
examples
32 / 64 / 128
common
Demo
Interactive Web Demo time....

http://vision.stanford.edu/teaching/cs231n-demos/linea
r-classify/
Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Computational graphs

x
s (scores) hinge

* loss
+ L

W
R
Convolutional network (AlexNet)

input image
weights

loss

Figure copyright Alex Krizhevsky, Ilya

Sutskever, and Geoffrey Hinton, 2012.
Reproduced with permission.
Neural Turing Machine

input
image

loss