Convolutional Neural Networks
Convolutional Neural Networks
Eunbyung Park
Assistant Professor
Department of Artificial Intelligence
Eunbyung Park (silverbottlep.github.io)
Convolution
1D Convolution
• Convolution is a mathematical operation on two functions (𝑓 , 𝑔) that produces a third
function 𝑓 ∗ 𝑔
∞
𝑓 ∗ 𝑔 𝑡 ≔ න 𝑓 𝑡 − 𝜏 𝑔 𝜏 𝑑𝜏
−∞
𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
𝜏
1D Convolution 𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
• Flip the filter and sliding 𝜏
3 𝑓[𝑡]
1 2
−1 𝑡
𝑔[𝑡]
1 2 1
0⋅1+0⋅2+1⋅1=1 𝑓 ∗ 𝑔 [𝑡]
𝑡
1D Convolution 𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
• Flip the filter and sliding 𝜏
3 𝑓[𝑡]
1 2
−1 𝑡
𝑔[𝑡]
1 2 1
0⋅1+1⋅2+3⋅1=5
𝑓 ∗ 𝑔 [𝑡]
𝑡
1D Convolution 𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
• Flip the filter and sliding 𝜏
3 𝑓[𝑡]
1 2
−1 𝑡
𝑔[𝑡]
1 2 1
𝑡
9=1⋅1+3⋅2+2⋅1
𝑓 ∗ 𝑔 [𝑡]
𝑡
1D Convolution 𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
• Flip the filter and sliding 𝜏
3 𝑓[𝑡]
1 2
−1 𝑡
𝑔[𝑡]
1 2 1
6 = 3 ⋅ 1 + 2 ⋅ 2 + (−1) ⋅ 1
𝑓 ∗ 𝑔 [𝑡]
𝑡
1D Convolution 𝑓 ∗ 𝑔 [𝑡] ≔ 𝑓 𝑡 − 𝜏 𝑔 𝜏
• Flip the filter and sliding 𝜏
3 𝑓[𝑡]
1 2
−1 𝑡
𝑔[𝑡]
1 2 1
𝑓 ∗ 𝑔 [𝑡]
0 = 2 ⋅ 1 + (−1) ⋅ 2 + 0 ⋅ 1
𝑡
1D Convolution
• Example
Convolution - Wikipedia
1D Convolution
• Gaussian filter
𝑓(𝑡)
𝑓 𝑡 ∗ 𝑔(𝑡)
𝑔(𝑡)
2D Convolution
∞ ∞
𝑓 ∗ 𝑔 𝑠, 𝑡 ≔ න න 𝑓 𝑠 − 𝜏1 , 𝑡 − 𝜏2 𝑔 𝜏1 , 𝜏2 𝑑𝜏1 𝑑𝜏2
−∞ −∞
𝑓 ∗ 𝑔 [𝑠, 𝑡] ≔ 𝑓 𝑠 − 𝜏1 , 𝑡 − 𝜏2 𝑔 𝜏1 , 𝜏2
𝜏1 𝜏2
2D Convolution
• One input channel, e.g. gray color image
• Padding=1, stride=1
1 0 3 0 20 0 0 0 0
1 0 3 1 33 2 3 3 0 16
3 0 1 3 11 2 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• One input channel, e.g. gray color image
• Padding=1, stride=1
0 10 30 20 0 0 0
0 11 33 32 3 3 0 16 28
0 33 11 12 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• One input channel, e.g. gray color image
• Padding=1, stride=1
0 0 10 30 20 0 0
0 1 13 32 33 3 0 16 28 24
0 3 31 12 11 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
Feature map
filters
Input
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=1
1 0 3 0 20 0 0 0 0
1 0 3 1 33 2 3 3 0 16
3 0 1 3 11 2 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=1
0 10 30 20 0 0 0
0 11 33 32 3 3 0 16 28
0 33 11 12 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=1
0 0 10 30 20 0 0
0 1 13 32 33 3 0 16 28 24
0 3 31 12 11 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=1
0 0 0 0 0 0 0
0 1 3 2 3 3 0 16 28 24 28 16
0 3 1 2 1 1 0 1 3 2 27 41 38 37 21
0 3 3 3 1 2 0 ∗ 1 3 3 = 33 40 33 25 18
0 2 2 1 2 1 0 3 1 1 32 40 37 29 17
0 2 3 2 1 2 0 25 27 21 20 12
filter
0 0 0 0 0 0 0
Output
Input
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=2
10 30 20 0 0 0 0
10 31 33 2 3 3 0
30 13 11 2 1 1 0 16
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=2
0 0 10 30 20 0 0
0 1 13 32 33 3 0
0 3 31 12 11 1 0 16 24
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=2
0 0 0 0 10 30 20
0 1 3 2 13 33 30
0 3 1 2 31 11 10 16 24 16
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=2
0 0 0 0 0 0 0
0 1 3 2 3 3 0
10 33 21 2 1 1 0 16 24 16
10 33 33 3 1 2 0 33
30 12 12 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=1, output_channel=1, padding=1, stride=2
0 0 0 0 0 0 0
0 1 3 2 3 3 0
0 3 1 2 1 1 0 1 3 2 16 24 16
0 3 3 3 1 2 0 ∗ 1 3 3 = 33 33 18
0 2 2 1 2 1 0 3 1 1 25 21 12
0 2 3 2 1 2 0
filter Output
0 0 0 0 0 0 0
Input
2D Convolution
• Input_channel=1, output_channel=1, padding=2, stride=1
0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 1 4 8 14 12 12 9
0 0 1 3 2 3 3 0 0 6 16 28 24 28 16 6
0 0 3 1 2 1 1 0 0 1 3 2 14 27 41 38 37 21 10
0 0 3 3 3 1 2 0 0 ∗ 1 3 3 = 17 33 40 33 25 18 6
0 0 2 2 1 2 1 0 0 3 1 1 14 32 40 37 29 17 9
0 0 2 3 2 1 2 0 0 10 25 27 21 20 12 3
0 0 0 0 0 0 0 0 0 4 12 15 11 9 7 2
0 0 0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=1, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=1, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=1, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=1, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=2, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=2, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=2, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=3, output_channel=2, padding=1, stride=1
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
0 0 0 0 0 0 0
0 0
0 0
0 0
0 0
0 0
0 0 0 0 0 0 0
2D Convolution
• Input_channel=64, output_channel=64, kernel_size=3, padding=1, stride=1
64
3
3
64 64
64
⋮
2D Convolution
• Input_channel=64, output_channel=64, kernel_size=3, padding=1, stride=1
⋮
2D Convolution
• Input_channel=64, output_channel=64, kernel_size=3, padding=1, stride=1
⋮
2D Convolution
• Input_channel=64, output_channel=64, kernel_size=3, padding=1, stride=1
⋮
Convolutions in PyTorch
Max Pooling
• Pooling a maximum value given the window
• Used to reduce the size of feature maps
• Example) stride=2, padding=1
0 0 0 0 0 0 0
0 1 3 2 3 3 0
3
0 3 1 2 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
Max Pooling
• Pooling a maximum value given the window
• Used to reduce the size of feature maps
• Example) stride=2, padding=1
0 0 0 0 0 0 0
0 1 3 2 3 3 0
3 3
0 3 1 2 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
Max Pooling
• Pooling a maximum value given the window
• Used to reduce the size of feature maps
• Example) stride=2, padding=1
0 0 0 0 0 0 0
0 1 3 2 3 3 0
3 3 3
0 3 1 2 1 1 0
0 3 3 3 1 2 0
0 2 2 1 2 1 0
0 2 3 2 1 2 0
0 0 0 0 0 0 0
Max Pooling in PyTorch
AlexNet
Dropout Dropout
Understanding AlexNet | LearnOpenCV #
Fully Connected Layer vs Convolutional Layer
• Translation equivariance and parameter sharing
𝑊𝑥
𝑊 ∈ ℝ131072×131072 32
32
Flatten FC Layer Reshape
64
64
64
64
32 ⋅ 64 ⋅ 64 = 131072 32 ⋅ 64 ⋅ 64 = 131072
Fully Connected Layer vs Convolutional Layer
• Translation equivariance and parameter sharing
32
3
3
32
32
64
64 32
⋮
64
64
𝑊 ∈ ℝ32×32×3×3
Visualization of Learned Filter
• First layer conv filters
Applied Deep Learning - Part 4: Convolutional Neural Networks | by Arden Dertat | Towards Data Science
Visualization of Learned Feature Maps
Applied Deep Learning - Part 4: Convolutional Neural Networks | by Arden Dertat | Towards Data Science
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC)
ILSVRC
• ImageNet is an image database organized according to the WordNet
hierarchy (nouns)
• 1000 object classes
• About 1.2M training images, 50K validation images, 100K test images
AlexNet
ZFNet
GoogleNet Trimps-Soushen
VGGNet (Inception + WRN)
ResNet SENet
Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet | by Khush Patel | Towards Data Science
VGGNet
Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet | by Khush Patel | Towards Data Science
VGGNet
Architecture comparison of AlexNet, VGGNet, ResNet, Inception, DenseNet | by Khush Patel | Towards Data Science
GoogLeNet
• Winner of ISLVRC 2014
Max pooling
• Also called ‘Inception’
Concatenation
Convolution
Improving neural networks by preventing co-adaptation of feature detectors, hinton et al, arXiv 2012
Dropout
• A simple way to train deep neural
networks for improving generalization
performance
• Avoiding co-adaptations: a hidden unit
cannot rely on other hidden units being
present
• Model averaging
Improving neural networks by preventing co-adaptation of feature detectors, hinton et al, arXiv 2012
Stochastic Depth (a.k.a DropPath)
• Training short networks and use deep networks at test time
• During training, randomly drop a subset of layers and bypass them with identity
function
𝑥2
𝑥1
𝑥2
𝑁
1 (𝑖)
𝜇1 = 𝑥1
𝑁
𝑖=1
(𝑖) 𝑖
𝑥1 ≔ 𝑥1 − 𝜇1
𝑥1
𝑥2
𝑥1 ~𝑁(0,1)
𝑁
𝑥2 ~𝑁(0,1) 2
1 𝑖 2
𝜎1 = 𝑥1
𝑁
𝑖=1
(𝑖) 𝑖
𝑥1 ≔ 𝑥1 /𝜎1
𝑥1
mean
𝑥−𝜇
𝑧=
𝜎
Standard deviation
𝑧 ~𝑁(0,1)
Batch Normalization
• When un-normalized, the loss surface is more skewed (elongated)
• Input feature scales are very different each other
Batch Normalization
• Normalizing inputs (also hidden
units) based on mini-batch
statistics
• Computing mean and variance
given the current batch
• During testing, we may not have
enough batch size for this (e.g. 1
batch), using mean and variance
from the training phase
Batch normalization: accelerating deep network training by reducing internal covariate shift, Ioffe et al, ICML 2015
Batch Normalization
Batch normalization: accelerating deep network training by reducing internal covariate shift, Ioffe et al, ICML 2015
Batch Normalization in CNN
𝜇 ∈ ℝ? , 𝜎 2 ∈ ℝ?
Why Batch Normalization Works?
1. Normalization usually makes loss surface less ‘skewed’
2. BN may reduce the internal covariate shift
• [1502.03167] Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift (arxiv.org)
𝐻 𝐻
1 1 2
𝜇 = 𝑎𝑖𝑙
𝑙
𝜎 =𝑙
𝑎𝑖𝑙 − 𝜇 𝑙
𝐻 𝐻
𝑖=1 𝑖=1