0% found this document useful (0 votes)
145 views121 pages

Introduction To Feed Forward Neural Networks

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
145 views121 pages

Introduction To Feed Forward Neural Networks

Uploaded by

张立波
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 121

Introduction to Feed Forward Neural Networks

Lorenzo Servadei, Sebastian Schober, Daniela Lopera, Wolfgang Ecker


Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Optimization
Optimization

This image is CC0 1.0 public


domain
Optimization

Walking man image is CC0 1.0 public


domain
Optimization
Strategy #1: A first very bad idea solution:
Random search
Optimization
Lets see how well this works on the
test set...

15.5% accuracy! not


bad! (SOTA is ~95%)
Optimization

Strategy #2: Follow the


slope
Gradient

Strategy #2: Follow the slope


In 1-dimension, the derivative of a
function:

In multiple dimensions, the gradient is the vector of partial


derivatives along each dimension.

The slope in any direction is the dot product of the direction with
the gradient. The direction of steepest descent is the negative
gradient
Optimization Example
current W: gradient dW:

[0.34, [?,
-1.11, ?,
0.78, ?,
0.12, ?,
0.55, ?,
2.81, ?,
-3.1, ?,
-1.5, ?,
0.33,…] ?,…]
loss 1.25347
Optimization Example
current W: W + h (first dim): gradient dW:

[0.34, [0.34 + 0.0001, [?,


-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25322
Optimization Example
current W: W + h (first dim): gradient dW:

[0.34, [0.34 + 0.0001, [-2.5,


-1.11, -1.11, ?,
0.78, 0.78, ?,
0.12, 0.12, ?, (1.25322 - 1.25347)/0.0001
?, = -2.5
0.55, 0.55,
?,
2.81, 2.81,
?,
?,
-3.1, -3.1,
-1.5, -1.5, ?,…]
0.33,…] 0.33,…]
loss 1.25347 loss 1.25322
Optimization Example
current W: W+h gradient dW:
(second dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11 + 0.0001, ?,
0.78, 0.78, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25353
Optimization Example
current W: W + h (second gradient dW:
dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11 + 0.0001, 0.6,
0.78, 0.78, ?,
(1.25353 -
0.12, 0.12, ?, 1.25347)/0.0001
0.55, 0.55, ?, = 0.6
?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,…]
?,
0.33,…] 0.33,…]
loss 1.25347 loss 1.25353
Optimization Example
current W: W + h (third dim): gradient dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, ?,
0.12, 0.12, ?,
0.55, 0.55, ?,
2.81, 2.81, ?,
-3.1, -3.1, ?,
-1.5, -1.5, ?,
0.33,…] 0.33,…] ?,…]
loss 1.25347 loss 1.25347
Optimization Example
current W: W + h (third dim): gradient dW:

[0.34, [0.34, [-2.5,


-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0,
0.12, 0.12, ?,
?
(1.25347 -
0.55, 0.55,
,
1.25347)/0.0001
2.81, 2.81, =0 ?
-3.1, -3.1,
,
-1.5, -1.5, ?,…]
?
0.33,…] 0.33,…]
loss 1.25347 loss 1.25347
Optimization Example
current W: W + h (third gradient dW:
dim):
[0.34, [0.34, [-2.5,
-1.11, -1.11, 0.6,
0.78, 0.78 + 0.0001, 0,
0.12, 0.12, ?,
Numeric
,
0.55, 0.55, Gradient
?, Need to loop
› Slow!
2.81, 2.81,
?, all dimensions
over
-3.1, -3.1, › Approximate
?,
-1.5, -1.5, ?,…]
0.33,…] 0.33,…]
loss 1.25347 loss 1.25347
How to proceed

This is silly. The loss is just a function of W:

want
How to proceed
Hammer image is in the public
domain

This is silly. The loss is just a function of W:

want

Use calculus to compute


an
This image is in the public This image is in the public

analytic gradient
domain domain
Analytical Gradient
current W: gradient dW:

[0.34, [-2.5,
dW = ...
-1.11, 0.6,
0.78, (some function
0,
0.12, data and W)
0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, 1.1,
-1.5, 1.3,
0.33,…] -2.1,…]
loss
1.25347
Analytical Gradient

In summary:

› Numerical gradient: approximate, slow, easy


to write

› Analytic gradient: exact, fast, error-prone

=>

In practice: Always use analytic gradient, but


check implementation with numerical gradient.
This is called a gradient check.
Analytical Gradient
Analytical Gradient

W_2

original
W
negative gradient W_1
direction
Stochastic Gradient Descent (SGD)

Full sum
expensive
when N is
large!

Approximate
sum using a
minibatch of
examples
32 / 64 / 128
common
Demo
Interactive Web Demo time....

http://vision.stanford.edu/teaching/cs231n-demos/linea
r-classify/
Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Computational graphs

x
s (scores) hinge

* loss
+ L

W
R
Convolutional network (AlexNet)

input image
weights

loss

Figure copyright Alex Krizhevsky, Ilya


Sutskever, and Geoffrey Hinton, 2012.
Reproduced with permission.
Neural Turing Machine

input
image

loss

Figure reproduced with permission from a Twitter post by Andrej


Karpathy.
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain
rule:
Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Want:
Backpropagation: a simple example

e.g. x = -2, y = 5, z = -4

Chain rule:
Want:
f
“local
gradient”

f
“local
gradient”

gradients
“local
gradient”

gradients
“local
gradient”

gradients
“local
gradient”

gradients
Patterns in backward flow

add gate: gradient


distributor
Patterns in backward flow

add gate: gradient


distributor
Q: What is a max gate?
Patterns in backward flow

add gate: gradient


distributor
max gate: gradient
router
Patterns in backward flow

add gate: gradient


distributor
max gate: gradient
router
Q: What is a mul
gate?
Patterns in backward flow

add gate: gradient


distributor
max gate: gradient
router
mul gate: gradient
switcher
Gradients add at branches

+
Gradients for vectorized code

(x,y,z are now vectors)


This is now the
Jacobian matrix
(derivative of each
element of z w.r.t.
“local
each element of x)
gradient”

f
gradients
Vectorized operations

4096-d f(x) = 4096-d


input output
vector
max(0,x)
vector
(element
wise)
Vectorized operations

4096-d 4096-d
input vector output
vector
f(x) =
Q: what is the
max(0,x)
size of the
(element-
Jacobian matrix?
wise) Jacobian
matrix
Vectorized operations

Jacobian
matrix

4096-d 4096-d
input vector output
vector
f(x) =
Q: what is the
max(0,x)
size of the
(element-
Jacobian matrix?
wise)
[4096 x 4096!]
Vectorized operations

4096-d 4096-d
input vector f(x) = output
vector
Q: what is the max(0,x)
in practice we process an
size of the (element- entire minibatch (e.g. 100)
of examples at one time:
Jacobian wise)
i.e. Jacobian would technically be a
matrix? [4096 [409,600 x 409,600] matrix :\
x 4096!]
Vectorized operations

Jacobian matrix

4096-d
4096-d output vector
input vector
f(x) = Q2: what does it look
Q: what is the
max(0,x) like?
size of the
(element-
Jacobian
wise)
matrix? [4096
x 4096!]
Modularized implementation: forward / backward API

x
z
*
y
(x,y,z are
scalars)
Example: Caffe layers

Caffe is licensed under BSD


2-Clause
Caffe Sigmoid
Layer

* top_diff
(chain rule)

Caffe is licensed under BSD


2-Clause
Summary so far...

› neural nets will be very large: impractical to write down


gradient formula by hand for all parameters
› backpropagation = recursive application of the
chain rule along a computational graph to compute
the gradients of all
inputs/parameters/intermediates
› implementations maintain a graph structure, where the
nodes implement the forward() / backward() API
› forward: compute result of an operation and save
any intermediates needed for gradient computation
in memory
› backward: apply the chain rule to compute the
gradient of the loss function with respect to the
inputs
Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Next: Neural Networks
Neural networks: without the brain stuff

(Before) Linear score


function:
Neural networks: without the brain stuff

(Before) Linear score


function:
(Now) 2-layer Neural
Network
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network

x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
(Before) Linear score function: (Now)
2-layer Neural Network

x W1 h W2 s

3072 100 10
Neural networks: without the brain stuff

(Before) Linear score function:


(Now) 2-layer Neural Network
or 3-layer Neural Network
Full implementation of training a 2-layer Neural Network
needs ~20 lines:
Be very careful with your brain analogies!

Biological Neurons:
› Many different types
› Dendrites can perform complex non-linear computations
› Synapses are not a single weight but a complex non-linear
dynamical system
› Rate code may not be adequate

[Dendritic Computation. London and Hausser]


Activation functions

Sigmoid Leaky ReLU

tanh Maxout

ReLU ELU
Neural networks: Architectures

“2-layer Neural Net”, or “3-layer Neural Net”, or


“1-hidden-layer Neural Net” “2-hidden-layer Neural Net”
“Fully-connected” layers
Example feed-forward computation of a neural network
Summary

We arrange neurons into fully-connected layers


The abstraction of a layer has the nice property that it allows us to use
efficient vectorized code (e.g. matrix multiplies)
Neural networks are not really neural
Fully Connected Layer

32x32x3 image -> stretch to


3072 x 1
input activation

1 1
10 x
3072 10
3072
weights
Fully Connected Layer

32x32x3 image -> stretch to


3072 x 1
input activation

1 1
10 x
3072 10
3072
weighs 1 number:
the result of taking a dot
product between a row of W
and the input (a
3072-dimensional dot
product)
Outline

• Optimization

• Computational Graphs

• Neural Networks
− Intuition
− Theory
Two non-separable cases
First case
Lessonlearned

Looking at the data before choosing the model and be hazardous to your Eout

Data snooping
Logistic regression - Outline

• Themodel

• Error measure

• Learning algorithm
Generalized Linear Models
Logistic regression - Outline

• Themodel

• Error measure

• Learning algorithm
Sources

› https://work.caltech.edu/telecourse
› http://cs231n.stanford.edu/2017/syllabus.html

You might also like