Introduction To Feed Forward Neural Networks
Introduction To Feed Forward Neural Networks
• Optimization
• Computational Graphs
• Neural Networks
− Intuition
− Theory
Optimization
Optimization
The slope in any direction is the dot product of the direction with
the gradient. The direction of steepest descent is the negative
gradient
Optimization Example
current W: gradient dW:
[0.34, [?,
-1.11, ?,
0.78, ?,
0.12, ?,
0.55, ?,
2.81, ?,
-3.1, ?,
-1.5, ?,
0.33,…] ?,…]
loss 1.25347
Optimization Example
current W: W + h (first dim): gradient dW:
want
How to proceed
Hammer image is in the public
domain
want
analytic gradient
domain domain
Analytical Gradient
current W: gradient dW:
[0.34, [-2.5,
dW = ...
-1.11, 0.6,
0.78, (some function
0,
0.12, data and W)
0.2,
0.55, 0.7,
2.81, -0.5,
-3.1, 1.1,
-1.5, 1.3,
0.33,…] -2.1,…]
loss
1.25347
Analytical Gradient
In summary:
=>
W_2
original
W
negative gradient W_1
direction
Stochastic Gradient Descent (SGD)
Full sum
expensive
when N is
large!
Approximate
sum using a
minibatch of
examples
32 / 64 / 128
common
Demo
Interactive Web Demo time....
http://vision.stanford.edu/teaching/cs231n-demos/linea
r-classify/
Outline
• Optimization
• Computational Graphs
• Neural Networks
− Intuition
− Theory
Computational graphs
x
s (scores) hinge
* loss
+ L
W
R
Convolutional network (AlexNet)
input image
weights
loss
input
image
loss
e.g. x = -2, y = 5, z = -4
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain
rule:
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Want:
Backpropagation: a simple example
e.g. x = -2, y = 5, z = -4
Chain rule:
Want:
f
“local
gradient”
f
“local
gradient”
gradients
“local
gradient”
gradients
“local
gradient”
gradients
“local
gradient”
gradients
Patterns in backward flow
+
Gradients for vectorized code
f
gradients
Vectorized operations
4096-d 4096-d
input vector output
vector
f(x) =
Q: what is the
max(0,x)
size of the
(element-
Jacobian matrix?
wise) Jacobian
matrix
Vectorized operations
Jacobian
matrix
4096-d 4096-d
input vector output
vector
f(x) =
Q: what is the
max(0,x)
size of the
(element-
Jacobian matrix?
wise)
[4096 x 4096!]
Vectorized operations
4096-d 4096-d
input vector f(x) = output
vector
Q: what is the max(0,x)
in practice we process an
size of the (element- entire minibatch (e.g. 100)
of examples at one time:
Jacobian wise)
i.e. Jacobian would technically be a
matrix? [4096 [409,600 x 409,600] matrix :\
x 4096!]
Vectorized operations
Jacobian matrix
4096-d
4096-d output vector
input vector
f(x) = Q2: what does it look
Q: what is the
max(0,x) like?
size of the
(element-
Jacobian
wise)
matrix? [4096
x 4096!]
Modularized implementation: forward / backward API
x
z
*
y
(x,y,z are
scalars)
Example: Caffe layers
* top_diff
(chain rule)
• Optimization
• Computational Graphs
• Neural Networks
− Intuition
− Theory
Next: Neural Networks
Neural networks: without the brain stuff
x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
(Before) Linear score function: (Now)
2-layer Neural Network
x W1 h W2 s
3072 100 10
Neural networks: without the brain stuff
Biological Neurons:
› Many different types
› Dendrites can perform complex non-linear computations
› Synapses are not a single weight but a complex non-linear
dynamical system
› Rate code may not be adequate
tanh Maxout
ReLU ELU
Neural networks: Architectures
1 1
10 x
3072 10
3072
weights
Fully Connected Layer
1 1
10 x
3072 10
3072
weighs 1 number:
the result of taking a dot
product between a row of W
and the input (a
3072-dimensional dot
product)
Outline
• Optimization
• Computational Graphs
• Neural Networks
− Intuition
− Theory
Two non-separable cases
First case
Lessonlearned
Looking at the data before choosing the model and be hazardous to your Eout
Data snooping
Logistic regression - Outline
• Themodel
• Error measure
• Learning algorithm
Generalized Linear Models
Logistic regression - Outline
• Themodel
• Error measure
• Learning algorithm
Sources
› https://work.caltech.edu/telecourse
› http://cs231n.stanford.edu/2017/syllabus.html