Comp3314 8. Convolutional Neural Networks
Comp3314 8. Convolutional Neural Networks
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 1 27 Jan 2016
Convolution Layer
32x32x3 image
32 height
32 width
3 depth
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 10 27 Jan 2016
Convolution Layer
32x32x3 image
5x5x3 filter
32
32
3
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 11 27 Jan 2016
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image
5x5x3 filter
32
32
3
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 12 27 Jan 2016
Convolution Layer
32x32x3 image
5x5x3 filter
32
1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 13 27 Jan 2016
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32
28
new size
decreased
convolve (slide) over all
spatial locations
32 28
3 1
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 14 27 Jan 2016
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32
28
32 28
3 1
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 15 27 Jan 2016
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps
32
28
Convolution Layer
32 28
3 6
32 28
CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 17 27 Jan 2016
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
need to train the convolutional filter
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 18 27 Jan 2016
Preview [From recent Yann
LeCun slides]
looks more complicated, looks like real object,
to capture texture variations collect object parts, to recognize the real
object
wheel
more layers =
different orientation -> can recognize
to capture edges more
represent the complicated
boundary of the object object
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 19 27 Jan 2016
preview:
why use convolutional instead of
fully in the beginning?
fully connected has weight. fully connected
huge # of neurons = huge layer
weights
If we directly apply FC, the
weight matrix will be too huge
--> overfitting
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 22 27 Jan 2016
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32
28
32 28
3 1
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 24 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
move to the right by 1 pixel.
can only shift 5 times (to make the filter still inside the image)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 25 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 26 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 28 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
can only jump 3 times --> output = 3x3
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 29 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 30 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 31 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 32 27 Jan 2016
A closer look at spatial dimensions:
7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?
7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 33 27 Jan 2016
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 34 27 Jan 2016
In practice: Common to zero pad the border
pad 0 outside of the original image => to make the output not smaller
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
(recall:)
(N - F) / stride + 1
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 35 27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 36 27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 37 27 Jan 2016
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.
32 28 24
….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 38 27 Jan 2016
Examples time:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 39 27 Jan 2016
Examples time:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 40 27 Jan 2016
Examples time:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 41 27 Jan 2016
Examples time:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 42 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 43 27 Jan 2016
Common settings:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 44 27 Jan 2016
(btw, 1x1 convolution layers make perfect sense)
1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
can reduce/increase
64 the number of channel (from 64 to 32) 32
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 45 27 Jan 2016
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 49 27 Jan 2016
The brain/neuron view of CONV Layer
32x32x3 image
5x5x3 filter
32
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 50 27 Jan 2016
The brain/neuron view of CONV Layer
32
32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3 = window size
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 51 27 Jan 2016
The brain/neuron view of CONV Layer
32
= with diff weight factors
E.g. with 5 filters,
28 CONV layer consists of
neurons arranged in a 3D grid
(28x28x5)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 52 27 Jan 2016
two more layers to go: POOL/FC
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 53 27 Jan 2016
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 54 27 Jan 2016
MAX POOLING
another type of pooling = average pool
y
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 55 27 Jan 2016
2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 56 27 Jan 2016
Common settings:
F = 2, S = 2
F = 3, S = 2
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 57 27 Jan 2016
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks conv. layer = for image classification
pool layer = square layer, also designed for images
usually 2nd last layer = pooling
FC = more general,
can be used for any
imput (not only for
images)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 58 27 Jan 2016
Case Study: LeNet-5
detect handwritten digit
[LeCun et al., 1998]
no zero padding, it's okay bcs
the boundary are mostly just background noises
7 layers
Conv filters were 5x5, applied at stride 1
Subsampling (Pooling) layers were 2x2 applied at stride 2
use to detect handwritten code in post office
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 60 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 61 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 62 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 63 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 64 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 65 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 66 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 67 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 68 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 69 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 77 27 Jan 2016
(slide from Kaiming He’s recent presentation)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 78 27 Jan 2016
more layers = the error decreased
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 79 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)
at runtime: faster
than a VGGNet!
(even though it has
8x more layers)
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 80 27 Jan 2016
Case Study:
224x224x3
ResNet
[He et al., 2015]
spatial dimension
only 56x56!
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 81 27 Jan 2016
Case Study: ResNet [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 82 27 Jan 2016
Case Study: ResNet [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 84 27 Jan 2016
Case Study: ResNet [He et al., 2015]
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 86 27 Jan 2016
Summary
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 89 27 Jan 2016