0% found this document useful (0 votes)
12 views

Comp3314 8. Convolutional Neural Networks

The document provides an overview of Convolutional Neural Networks (CNNs), focusing on the convolution layer's structure and function. It explains how filters convolve over input images to produce activation maps, detailing the effects of filter size, stride, and padding on output dimensions. Additionally, it discusses the benefits of using convolutional layers over fully connected layers to reduce complexity and prevent overfitting.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views

Comp3314 8. Convolutional Neural Networks

The document provides an overview of Convolutional Neural Networks (CNNs), focusing on the convolution layer's structure and function. It explains how filters convolve over input images to produce activation maps, detailing the effects of filter size, stride, and padding on output dimensions. Additionally, it discusses the benefits of using convolutional layers over fully connected layers to reduce complexity and prevent overfitting.

Uploaded by

jocelynpratamah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

Convolutional Neural Networks

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 1 27 Jan 2016
Convolution Layer
32x32x3 image

32 height

32 width
3 depth

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 10 27 Jan 2016
Convolution Layer
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 11 27 Jan 2016
Convolution Layer Filters always extend the full
depth of the input volume
32x32x3 image

5x5x3 filter
32

Convolve the filter with the image


i.e. “slide over the image spatially,
computing dot products”

32
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 12 27 Jan 2016
Convolution Layer
32x32x3 image
5x5x3 filter
32

1 number:
the result of taking a dot product between the
filter and a small 5x5x3 chunk of the image
32 (i.e. 5*5*3 = 75-dimensional dot product + bias)
3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 13 27 Jan 2016
Convolution Layer
activation map
32x32x3 image
5x5x3 filter
32

28
new size
decreased
convolve (slide) over all
spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 14 27 Jan 2016
consider a second, green filter
Convolution Layer
32x32x3 image activation maps
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 15 27 Jan 2016
For example, if we had 6 5x5 filters, we’ll get 6 separate activation maps:
activation maps

32

28

Convolution Layer

32 28
3 6

We stack these up to get a “new image” of size 28x28x6!


Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 16 27 Jan 2016
Preview: ConvNet is a sequence of Convolution Layers, interspersed with
activation functions

32 28

CONV,
ReLU
e.g. 6
5x5x3
32 filters 28
3 6

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 17 27 Jan 2016
Preview: ConvNet is a sequence of Convolutional Layers, interspersed with
activation functions

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
need to train the convolutional filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 18 27 Jan 2016
Preview [From recent Yann
LeCun slides]
looks more complicated, looks like real object,
to capture texture variations collect object parts, to recognize the real
object

wheel

more layers =
different orientation -> can recognize
to capture edges more
represent the complicated
boundary of the object object

head of the bird

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 19 27 Jan 2016
preview:
why use convolutional instead of
fully in the beginning?
fully connected has weight. fully connected
huge # of neurons = huge layer
weights
If we directly apply FC, the
weight matrix will be too huge
--> overfitting

convolutional layer = smaller filter


= small weight; have small parameter
= try to reduce the complexity & avoid overfitting

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 22 27 Jan 2016
A closer look at spatial dimensions:
activation map
32x32x3 image
5x5x3 filter
32

28

convolve (slide) over all


spatial locations

32 28
3 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 23 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 24 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
move to the right by 1 pixel.
can only shift 5 times (to make the filter still inside the image)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 25 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 26 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 27 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter

=> 5x5 output


7

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 28 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
can only jump 3 times --> output = 3x3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 29 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 30 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 2
=> 3x3 output!
7

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 31 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 32 27 Jan 2016
A closer look at spatial dimensions:

7
7x7 input (spatially)
assume 3x3 filter
applied with stride 3?

7 doesn’t fit!
cannot apply 3x3 filter on
7x7 input with stride 3.

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 33 27 Jan 2016
N
Output size:
(N - F) / stride + 1
F
e.g. N = 7, F = 3:
F N
stride 1 => (7 - 3)/1 + 1 = 5
stride 2 => (7 - 3)/2 + 1 = 3
stride 3 => (7 - 3)/3 + 1 = 2.33 :\

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 34 27 Jan 2016
In practice: Common to zero pad the border
pad 0 outside of the original image => to make the output not smaller
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0

(recall:)
(N - F) / stride + 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 35 27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 36 27 Jan 2016
In practice: Common to zero pad the border
0 0 0 0 0 0
e.g. input 7x7
0 3x3 filter, applied with stride 1
0 pad with 1 pixel border => what is the output?
0
7x7 output!
0
in general, common to see CONV layers with
stride 1, filters of size FxF, and zero-padding with
(F-1)/2. (will preserve size spatially)
e.g. F = 3 => zero pad with 1
F = 5 => zero pad with 2
F = 7 => zero pad with 3

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 37 27 Jan 2016
Remember back to…
E.g. 32x32 input convolved repeatedly with 5x5 filters shrinks volumes spatially!
(32 -> 28 -> 24 ...). Shrinking too fast is not good, doesn’t work well.

32 28 24

….
CONV, CONV, CONV,
ReLU ReLU ReLU
e.g. 6 e.g. 10
5x5x3 5x5x6
32 filters 28 filters 24
3 6 10
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 38 27 Jan 2016
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size: ?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 39 27 Jan 2016
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Output volume size:


(32+2*2-5)/1+1 = 32 spatially, so
32x32x10
N + 2 x padding - F) / stride + 1

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 40 27 Jan 2016
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 41 27 Jan 2016
Examples time:

Input volume: 32x32x3


10 5x5 filters with stride 1, pad 2

Number of parameters in this layer?


each filter has 5*5*3 + 1 = 76 params (+1 for bias)
=> 76*10 = 760

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 42 27 Jan 2016
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 43 27 Jan 2016
Common settings:

K = (powers of 2, e.g. 32, 64, 128, 512)


- F = 3, S = 1, P = 1
- F = 5, S = 1, P = 2
- F = 5, S = 2, P = ? (whatever fits)
- F = 1, S = 1, P = 0

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 44 27 Jan 2016
(btw, 1x1 convolution layers make perfect sense)

1x1 CONV
56 with 32 filters
56
(each filter has size
1x1x64, and performs a
64-dimensional dot
56 product)
56
can reduce/increase
64 the number of channel (from 64 to 32) 32

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 45 27 Jan 2016
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32

1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 49 27 Jan 2016
The brain/neuron view of CONV Layer

32x32x3 image
5x5x3 filter
32

It’s just a neuron with local


connectivity...
1 number:
32 the result of taking a dot product between
the filter and this part of the image
3
(i.e. 5*5*3 = 75-dimensional dot product)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 50 27 Jan 2016
The brain/neuron view of CONV Layer

32

28 An activation map is a 28x28 sheet of neuron


outputs:
1. Each is connected to a small region in the input
2. All of them share parameters

32
28 “5x5 filter” -> “5x5 receptive field for each neuron”
3 = window size

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 51 27 Jan 2016
The brain/neuron view of CONV Layer

32
= with diff weight factors
E.g. with 5 filters,
28 CONV layer consists of
neurons arranged in a 3D grid
(28x28x5)

There will be 5 different


32 28 neurons all looking at the same
3 region in the input volume
5

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 52 27 Jan 2016
two more layers to go: POOL/FC

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 53 27 Jan 2016
Pooling layer
- makes the representations smaller and more manageable
- operates over each activation map independently:

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 54 27 Jan 2016
MAX POOLING
another type of pooling = average pool

Single depth slice


1 1 2 4 max pool = take the max value from 2x2 filters
x max pool with 2x2 filters
5 6 7 8 and stride 2 6 8

3 2 1 0 stride is always >1 to get smaller output 3 4


stride 2 ==> reduce, output 2x smaller (half)
1 2 3 4

y
Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 55 27 Jan 2016
2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 56 27 Jan 2016
Common settings:

F = 2, S = 2
F = 3, S = 2

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 57 27 Jan 2016
Fully Connected Layer (FC layer)
- Contains neurons that connect to the entire input volume, as in ordinary Neural
Networks conv. layer = for image classification
pool layer = square layer, also designed for images
usually 2nd last layer = pooling

FC = more general,
can be used for any
imput (not only for
images)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 58 27 Jan 2016
Case Study: LeNet-5
detect handwritten digit
[LeCun et al., 1998]
no zero padding, it's okay bcs
the boundary are mostly just background noises

-> num of classes

7 layers
Conv filters were 5x5, applied at stride 1
Subsampling (Pooling) layers were 2x2 applied at stride 2
use to detect handwritten code in post office
i.e. architecture is [CONV-POOL-CONV-POOL-CONV-FC]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 60 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Q: what is the output volume size? Hint: (227-11)/4+1 = 55

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 61 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]

Q: What is the total number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 62 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images

First layer (CONV1): 96 11x11 filters applied at stride 4


=>
Output volume [55x55x96]
Parameters: (11*11*3)*96 = 35K

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 63 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2

Q: what is the output volume size? Hint: (55-3)/2+1 = 27

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 64 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96

Q: what is the number of parameters in this layer?

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 65 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96

Second layer (POOL1): 3x3 filters applied at stride 2


Output volume: 27x27x96
Parameters: 0!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 66 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Input: 227x227x3 images


After CONV1: 55x55x96
After POOL1: 27x27x96
...

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 67 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
[27x27x96] MAX POOL1: 3x3 filters at stride 2
[27x27x96] NORM1: Normalization layer
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2 use larger depth
[13x13x256] MAX POOL2: 3x3 filters at stride 2
[13x13x256] NORM2: Normalization layer
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
[6x6x256] MAX POOL3: 3x3 filters at stride 2
[4096] FC6: 4096 neurons
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 68 27 Jan 2016
Case Study: AlexNet
[Krizhevsky et al. 2012]

Full (simplified) AlexNet architecture:


[227x227x3] INPUT
[55x55x96] CONV1: 96 11x11 filters at stride 4, pad 0
Details/Retrospectives:
[27x27x96] MAX POOL1: 3x3 filters at stride 2
- first use of ReLU
[27x27x96] NORM1: Normalization layer
- used Norm layers (not common anymore)
[27x27x256] CONV2: 256 5x5 filters at stride 1, pad 2
- heavy data augmentation
[13x13x256] MAX POOL2: 3x3 filters at stride 2
- dropout 0.5
[13x13x256] NORM2: Normalization layer
- batch size 128
[13x13x384] CONV3: 384 3x3 filters at stride 1, pad 1
- SGD Momentum 0.9
[13x13x384] CONV4: 384 3x3 filters at stride 1, pad 1
- Learning rate 1e-2, reduced by 10
[13x13x256] CONV5: 256 3x3 filters at stride 1, pad 1
manually when val accuracy plateaus
[6x6x256] MAX POOL3: 3x3 filters at stride 2
- L2 weight decay 5e-4
[4096] FC6: 4096 neurons
- 7 CNN ensemble: 18.2% -> 15.4%
[4096] FC7: 4096 neurons
[1000] FC8: 1000 neurons (class scores)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 69 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)

Slide from Kaiming He’s recent presentation https://www.youtube.com/watch?v=1PGLj-uKT1w

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 77 27 Jan 2016
(slide from Kaiming He’s recent presentation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 78 27 Jan 2016
more layers = the error decreased

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 79 27 Jan 2016
Case Study: ResNet [He et al., 2015]
ILSVRC 2015 winner (3.6% top 5 error)

2-3 weeks of training


on 8 GPU machine

at runtime: faster
than a VGGNet!
(even though it has
8x more layers)

(slide from Kaiming He’s recent presentation)

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 80 27 Jan 2016
Case Study:
224x224x3
ResNet
[He et al., 2015]
spatial dimension
only 56x56!

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 81 27 Jan 2016
Case Study: ResNet [He et al., 2015]

can have non-zero gradient that can directly propagate through x

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 82 27 Jan 2016
Case Study: ResNet [He et al., 2015]

--> to reduce overfitting (from 256 to 64)

--> have to restore the original channel

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 84 27 Jan 2016
Case Study: ResNet [He et al., 2015]

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 86 27 Jan 2016
Summary

- ConvNets stack CONV,POOL,FC layers


- Trend towards smaller filters and deeper architectures
- Trend towards getting rid of POOL/FC layers (just CONV)
- Typical architectures look like
[(CONV-RELU)*N-POOL?]*M-(FC-RELU)*K,SOFTMAX
where N is usually up to ~5, M is large, 0 <= K <= 2.
- but recent advances such as ResNet/GoogLeNet
challenge this paradigm

Fei-Fei Li & Andrej Karpathy & Justin Johnson Lecture 7 - 89 27 Jan 2016

You might also like