6.874, 6.802, 20.390, 20.490, HST.506
Computational Systems Biology
Deep Learning in the Life Sciences
Lecture 3:
Convolutional Neural Networks
Prof. Manolis Kellis
http://mit6874.github.io
Slides credit: 6.S191, Dana Erlich, Param Vir Singh,
David Gifford, Alexander Amini, Ava Soleimany
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
1a. What do you see, and how?
Can we teach machines to see?
What do you see?
How do you see?
How can we help
computers see?
What computers ‘see’: Images as Numbers
What the computer "sees"
Levin
Image
Processing
&
Computer
Vision
An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image.
Question: is this Lincoln?Washington? Jefferson? Obama?
How can the computer answer this question?
What you see
Input Image Input Image + values Pixel intensity values
(“pix-el”=picture-element)
What you both see
Can I just do classification on the 1,166400-long image vector directly?
No. Instead: exploit image spatial structure. Learn patches. Build them up
1b. Classical machine vision roots
in study of human/animal brains
Inspiration: human/animal visual cortex
• Layers of neurons: pixels, edges, shapes, primitives, scenes
• E.g. Layer 4 responds to bands w/ given slant, contrasting edges
Primitives: Neurons & action potentials
•Chemical accumulation across
dendritic connections
•Pre-synaptic axon
 post-synaptic dendrite
 neuronal cell body
•Each neuron receives multiple
signals from its many dendrites
•When threshold crossed, it fires
•Its axon then sends outgoing
signal to downstream neurons
•Weak stimuli ignored
•Sufficiently strong cross
activation threshold
•Non-linearity within
each neuronal level
•Neurons connected into circuits (neural networks): emergent properties, learning, memory
•Simple primitives arranged in simple, repetitive, and extremely large networks
•86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
Abstraction layers: edges, bars, dir., shapes, objects, scenes
LGN: Small dots
V1: Orientation,
disparity, some color
V4: Color, basic shapes,
2D/3D, curvature
VTC: Complex features
and objects(VTC: ventral temporal cortex
•Abstraction layers  visual cortex layers
•Complex concepts from simple parts, hierarchy
•Primitives of visual concepts encoded in
neuronal connection in early cortical layers
• Massive recent expanse of human brain has re-used a
relatively simple but general learning architecture
General “learning machine”, reused widely
• Hearing, taste, smell, sight, touch all re-
use similar learning architecture
Motor Cortex
Visual Cortex • Interchangeable
circuitry
• Auditory cortex
learns to ‘see’ if
sent visual signals
• Injury area tasks
shift to uninjured
areas
• Not fully-general learning, but well-adapted to our world
• Humans co-opted this circuitry to many new applications
• Modern tasks accessible to any homo sapiens (<70k years)
• ML primitives not too different from animals: more to come?
human
chimp
Hardware
expansion
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
2a. Spatial structure
for image recognition
Using Spatial Structure
Idea: connect
patches of input to
neurons in hidden
layer.
Neuron connected
to region of input.
Only “sees”these
values.
Input: 2D
image.
Array of pixel
values
Using Spatial Structure
Connect patch in input layer to a single neuron in subsequent layer.
Use a sliding window to define connections.
How can we weight the patch to detect particular features?
Feature Extraction with Convolution
- Filter of size 4x4 :16 different weights
- Apply this same filter to 4x4 patches in input
- Shift by 2 pixels for next patch
This“patchy” operation isconvolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
Fully Connected Neural Network
Fully Connected:
• Each neuron in
hidden layer
connected to all
neurons in input
layer
• No spatial information
• Many, many
parameters
Input:
• 2D image
• Vector of pixel
values
Key idea: Use spatial structure in input to inform architecture
of the network
High Level Feature Detection
Let’s identify key features in each image category
Wheels,License Plate,
Headlights
Door,Windows,Steps
Nose,Eyes,Mouth
Fully Connected Neural Network
2b. Convolutions and filters
Convolution operation is element wise
multiply and add
Filter / Kernel
Producing Feature Maps
Original Sharpen Edge Detect “Strong” Edge
Detect
A simple pattern: Edges
How can we detect edges with a kernel?
Input
-1 -1
Filter
Output
(Goodfellow 2016)
Simple Kernels / Filters
X or X?
Image is represented as matrix of pixel values… and computers are literal!
We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed.
Rohrer How do CNNs work?
There are three approaches to edge cases in
convolution
(Goodfellow 2016)
Zero Padding Controls Output Size
• Full convolution: zero pad input so output is produced whenever an output value
contains at least one input value (expands output)
• Valid-only convolution: output only when
entire kernel contained in input (shrinks output)
• Same convolution: zero pad input so output
is same size as input dimensions
x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME')
• TF convolution operator takes stride and zero fill option as parameters
• Stride is distance between kernel applications in each dimension
• Padding can be SAME or VALID
CNN Algorithm
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
3a. Learning Visual Features
de novo
Key idea:
learn hierarchy of features
directly from the data
(rather than hand-engineering them)
Low level features Mid level features High level features
Lee+ ICML 2009
Eyes,ears,nose
Edges,dark spots Facial structure
Key idea: re-use parameters
Convolution shares parameters
Example 3x3 convolution on a 5x5 image
Feature Extraction with Convolution
1) Apply a set of weights – a filter – to extract local features
2) Use multiple filters to extract different features
3) Spatially share parameters of each filter
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
LeNet-5
⋮ ⋮
�
𝑦𝑦
32×32×1 28×28×6 14×14×6 10×10×16
5×5×16
120 84
5 × 5
s = 1
f = 2
s = 2
avg pool
5 × 5
s = 1
avg pool
f = 2
s = 2
. . .
. . .
Reminder:
Output size = (N+2P-F)/stride + 1
10
conv conv
FC FC
[LeCun et al., 1998]
This slide is taken from Andrew Ng
LeNet-5
• Only 60K parameters
• As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑
• General structure:
conv->pool->conv->pool->FC->FC->output
• Different filters look at different channels
• Sigmoid and Tanh nonlinearity
[LeCun et al., 1998]
Backpropagation of convolution
Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
3b. Convolutional Neural
Networks (CNNs)
An image classification CNN
Representation Learning in Deep CNNs
Mid level features
Low level features High level features
Edges,dark spots
Conv Layer 1
Lee+ ICML 2009
Eyes,ears,nose
Conv Layer 2
Facial structure
Conv Layer 3
CNNs for Classification
1. Convolution:Apply filters to generate feature maps.
2. Non-linearity:Often ReLU.
3. Pooling:Downsampling operation on each feature map.
Trainmodel with image data.
Learn weights of filters in convolutional layers.
tf.keras.layers.Conv2
D
tf.keras.activations.
*
tf.keras.layers.MaxPool2
D
Example – Six convolutional layers
Convolutional Layers: Local Connectivity
For a neuron in
hidden layer:
- Take inputs from patch
- Compute weighted
sum
- Apply bias
tf.keras.layers.
Conv2D
Convolutional Layers: Local Connectivity
For a neuron in hidden layer:
• Take inputs from patch
• Compute weighted sum
• Apply bias
4x4 filter:
matrix of
weights wij for neuron (p,q) in hidden layer
1) applying a window of weights
2) computing linear combinations
3) activating with non-linear function
tf.keras.layers.Conv2D
CNNs: Spatial Arrangement of Output
Volume
depth
width
height
Layer Dimensions:
ℎ  w d
where h and w are spatial
dimensions d (depth) = number of
filters
Stride:
Filter step size
Receptive Field:
Locations in input image
that a node is path
connected to
tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
Introducing Non-Linearity
Rectified Linear Unit
(ReLU)
- Apply after every convolution operation
(i.e.,after convolutional layers)
- ReLU:pixel-by-pixel operation that replaces
all negative values by zero.
- Non-linear operation
tf.keras.layers.ReLU
Karn Intuitive CNNs
Pooling
Max Pooling,average pooling
1) Reduced
dimensionality
2) Spatial invariance
tf.keras.layers.Max
Pool2D(
pool_size=(2,2),
strides=2
)
The REctified Linear Unit (RELU) is a common
non-linear detector stage after convolution
x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME')
x = tf.nn.bias_add(x, b)
x= tf.nn.relu(x)
f(x) = max(0, x)
When will we backpropagate through this?
Once it “dies” what happens to it?
Pooling reduces dimensionality by giving up
spatial location
• max pooling reports the maximum output
within a defined neighborhood
• Padding can be SAME or VALID
x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME')
Output Input Pooling Batch H W Input channel
Neighborhood
[batch, height, width, channels]
Dilated Convolution
91
CNNs for Classification: Feature Learning
1. Learn features in input image through convolution
2. Introduce non-linearity through activation function (real-world data is
non-linear!)
3. Reduce dimensionality and preserve spatial invariance with pooling
CNNs for Classification: Class Probabilities
- CONV and POOL layers output high-level features of input
- Fully connected layer uses these features for classifying input image
- Express output as probability of image belonging to a particular class
Putting it all together
import tensorflow as tf
def generate_model():
model = tf.keras.Sequential([
# first convolutional layer
tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# second convolutional layer
tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’),
tf.keras.layers.MaxPool2D(pool_size=2, strides=2),
# fully connected classifier
tf.keras.layers.Flatten(),
tf.keras.layers.Dense(1024, activation='relu’),
tf.keras.layers.Dense(10, activation=‘softmax’)
# 10 outputs
])
return model
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
4a. Real-world feature invariance is
hard
How can computers recognize objects?
How can computers recognize objects?
Challenge:
• Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc.
• How can we overcome this challenge?
Answer:
• Learn a ton of features (millions) from the bottom up
• Learn the convolutional filters, rather than pre-computing them
Detect
features
to
classify
Li/Johnson/Yeung C231n
Feature invariance to perturbation is hard
Next-generation models
explode # of parameters
LeNet-5
• Gradient Based Learning Applied To Document Recognition -
Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998
• Helped establish how we use CNNs today
• Replaced manual feature extraction
[LeCun et al., 1998]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
• ImageNet Classification with Deep Convolutional
Neural Networks - Alex Krizhevsky, Ilya Sutskever,
Geoffrey E. Hinton; 2012
• Facilitated by GPUs, highly optimized convolution
implementation and large datasets (ImageNet)
• One of the largest CNNs to date
• Has 60 Million parameter compared to 60k
parameter of LeNet-5
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
• The annual “Olympics” of computer vision.
• Teams from across the world compete to see who has the
best computer vision model for tasks such as classification,
localization, detection, and more.
• 2012 marked the first year where a CNN was used to
achieve a top 5 test error rate of 15.3%.
• The next best entry achieved an error of 26.2%.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
• Input: 227x227x3 images (224x224 before
padding)
• First layer: 96 11x11 filters applied at stride 4
• Output volume size?
(N-F)/s+1 = (227-11)/4+1 = 55 ->
[55x55x96]
• Number of parameters in this layer?
(11*11*3)*96 = 35K
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
AlexNet
[Krizhevsky et al., 2012]
• Input: 227x227x3 images (224x224 before
padding)
• After CONV1: 55x55x96
• Second layer: 3x3 filters applied at stride 2
• Output volume size?
(N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96]
• Number of parameters in this layer?
0!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Architecture
CONV1
MAX POOL1
NORM1
CONV2
MAX POOL2
NORM2
CONV3
CONV4
CONV5
Max POOL3
FC6
FC7
FC8
AlexNet
. . .
227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256
13×13
×256
13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256
11 × 11
s = 4
P = 0
3 × 3
s = 2
max pool
5 × 5
S = 1
P = 2
3 × 3
s = 2
max pool
3 × 3
S = 1
P = 1
3 × 3
s = 1
P = 1
3 × 3
S = 1
P = 1
3 × 3
s = 2
max pool
conv conv
conv conv conv
. . .
[Krizhevsky et al., 2012]
. . .
This slide is taken from Andrew Ng
AlexNet
. . .
4096 4096
Softmax
1000
⋮ ⋮
[Krizhevsky et al., 2012]
FC FC
This slide is taken from Andrew Ng
AlexNet
[Krizhevsky et al., 2012]
Details/Retrospectives:
• first use of ReLU
• used Norm layers (not common anymore)
• heavy data augmentation
• dropout 0.5
• batch size 128
• 7 CNN ensemble
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
[Krizhevsky et al., 2012]
• Trained on GTX 580 GPU with only 3 GB of memory.
• Network spread across 2 GPUs, half the neurons (feature
maps) on each GPU.
• CONV1, CONV2, CONV4, CONV5:
Connections only with feature maps on same GPU.
• CONV3, FC6, FC7, FC8:
Connections with all feature maps in preceding layer,
communication across GPUs.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
AlexNet
AlexNet was the coming out party for CNNs in the computer
vision community. This was the first time a model performed
so well on a historically difficult ImageNet dataset. This
paper illustrated the benefits of CNNs and backed them up
with record breaking performance in the competition.
[Krizhevsky et al., 2012]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
• Very Deep Convolutional Networks For Large Scale
Image Recognition - Karen Simonyan and Andrew
Zisserman; 2015
• The runner-up at the ILSVRC 2014 competition
• Significantly deeper than AlexNet
• 140 million parameters
[Simonyan and Zisserman, 2014]
VGGNet
• Smaller filters
Only 3x3 CONV filters, stride 1, pad 1
and 2x2 MAX POOL , stride 2
• Deeper network
AlexNet: 8 layers
VGGNet: 16 - 19 layers
• ZFNet: 11.7% top 5 error in ILSVRC’13
• VGGNet: 7.3% top 5 error in ILSVRC’14
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014]
Input
3x3 conv, 64
3x3 conv, 64
Pool 1/2
3x3 conv, 128
3x3 conv, 128
Pool 1/2
3x3 conv, 256
3x3 conv, 256
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool 1/2
FC 4096
FC 4096
FC 1000
Softmax
VGGNet
[Simonyan and Zisserman, 2014]
• Why use smaller filters? (3x3 conv)
Stack of three 3x3 conv (stride 1) layers has the same effective
receptive field as one 7x7 conv layer.
• What is the effective receptive field of three 3x3 conv (stride
1) layers?
7x7
But deeper, more non-linearities
And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
[Simonyan and Zisserman, 2014]
VGG16:
TOTAL memory: 24M * 4 bytes ~= 96MB / image
TOTAL params: 138M parameters
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input
3x3 conv, 64
3x3 conv, 64
Pool
3x3 conv, 128
3x3 conv, 128
Pool
3x3 conv, 256
3x3 conv, 256
3x3 conv, 256
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
3x3 conv, 512
3x3 conv, 512
3x3 conv, 512
Pool
FC 4096
FC 4096
FC 1000
Softmax
[Simonyan and Zisserman, 2014]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Input memory: 224*224*3=150K params: 0
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728
3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864
Pool memory: 112*112*64=800K params: 0
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728
3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 =
147,456
Pool memory: 56*56*128=400K params: 0
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824
Pool memory: 28*28*256=200K params: 0
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296
Pool memory: 14*14*512=100K params: 0
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296
Pool memory: 7*7*512=25K params: 0
FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448
FC 4096 memory: 4096 params: 4096*4096 = 16,777,216
FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
VGGNet
[Simonyan and Zisserman, 2014]
Details/Retrospectives :
• ILSVRC’14 2nd in classification, 1st in localization
• Similar training procedure as AlexNet
• No Local Response Normalisation (LRN)
• Use VGG16 or VGG19 (VGG19 only slightly better, more
memory)
• Use ensembles for best results
• FC7 features generalize well to other tasks
• Trained on 4 Nvidia Titan Black GPUs for two to three weeks.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
VGGNet
VGG Net reinforced the notion that convolutional neural
networks have to have a deep network of layers in order for
this hierarchical representation of visual data to work.
Keep it deep.
Keep it simple.
[Simonyan and Zisserman, 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
GoogleNet
• Going Deeper with Convolutions - Christian Szegedy et
al.; 2015
• ILSVRC 2014 competition winner
• Also significantly deeper than AlexNet
• x12 less parameters than AlexNet
• Focused on computational efficiency
[Szegedy et al., 2014]
GoogleNet
• 22 layers
• Efficient “Inception” module - strayed from
the general approach of simply stacking conv
and pooling layers on top of each other in a
sequential structure
• No FC layers
• Only 5 million parameters!
• ILSVRC’14 classification winner (6.7% top 5
error)
[Szegedy et al., 2014]
GoogleNet
“Inception module”: design a good local network topology (network within
a network) and then stack these modules on top of each other
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
Filter
concatenation
Previous layer
1x1
convolution
3x3
convolution
5x5
convolution
1x1
convolution
1x1
convolution
1x1
convolution
3x3 max
pooling
GoogleNet
Details/Retrospectives :
• Deeper networks, with computational efficiency
• 22 layers
• Efficient “Inception” module
• No FC layers
• 12x less params than AlexNet
• ILSVRC’14 classification winner (6.7% top 5 error)
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
GoogleNet
Introduced the idea that CNN layers didn’t always have to be
stacked up sequentially. Coming up with the Inception
module, the authors showed that a creative structuring of
layers can lead to improved performance and
computationally efficiency.
[Szegedy et al., 2014]
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Deep Residual Learning for Image Recognition -
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun;
2015
• Extremely deep network – 152 layers
• Deeper neural networks are more difficult to train.
• Deep networks suffer from vanishing and
exploding gradients.
• Present a residual learning framework to ease the
training of networks that are substantially deeper
than those used previously.
[He et al., 2015]
ResNet
• ILSVRC’15 classification winner (3.57% top 5
error, humans generally hover around a 5-
10% error rate)
Swept all classification and detection
competitions in ILSVRC’15 and COCO’15!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• What happens when we continue stacking deeper layers on a
convolutional neural network?
• 56-layer model performs worse on both training and test error
-> The deeper model performs worse (not caused by overfitting)!
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
• Hypothesis: The problem is an optimization problem. Very
deep networks are harder to optimize.
• Solution: Use network layers to fit residual mapping instead
of directly trying to fit a desired underlying mapping.
• We will use skip connections allowing us to take the activation
from one layer and feed it into another layer, much deeper
into the network.
• Use layers to fit residual F(x) = H(x) – x
instead of H(x) directly
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
ResNet
Residual Block
Input x goes through conv-relu-conv series and gives us F(x).
That result is then added to the original input x. Let’s call that
H(x) = F(x) + x.
In traditional CNNs, H(x) would just be equal to F(x). So, instead
of just computing that transformation (straight from x to F(x)),
we’re computing the term that we have to add, F(x), to the
input, x.
[He et al., 2015]
ResNet
Short cut/ skip connection
𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2]
𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏]
𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏])
𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐])
𝑎𝑎[𝑙𝑙+1]
a[l]
a[l+1]
𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2]
𝐚𝐚[𝐥𝐥+𝟐𝟐]
= 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐
+ 𝐚𝐚 𝐥𝐥
= 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐]
𝐚𝐚[𝐥𝐥+𝟏𝟏]
+ 𝐛𝐛[𝐥𝐥+𝟐𝟐]
+ 𝐚𝐚 𝐥𝐥
)
[He et al., 2015]
ResNet
Full ResNet architecture:
• Stack residual blocks
• Every residual block has two 3x3 conv layers
• Periodically, double # of filters and
downsample spatially using stride 2 (in each
dimension)
• Additional conv layer at the beginning
• No FC layers at the end (only FC 1000 to
output classes)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
• Total depths of 34, 50, 101, or 152 layers for
ImageNet
• For deeper networks (ResNet-50+), use
“bottleneck” layer to improve efficiency
(similar to GoogLeNet)
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
Experimental Results:
• Able to train very deep networks without degrading
• Deeper networks now achieve lower training errors as
expected
[He et al., 2015]
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ResNet
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Even better than human performance!
[He et al., 2015]
Accuracy comparison
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Forward pass time and power
consumption
The best CNN architecture that we currently have and is a
great innovation for the idea of residual learning.
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
ImageNet Large Scale Visual Recognition
Challenge (ILSVRC) winners
Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Countless applications
An Architecture for Many Applications
Detection
Semantic segmentation
End-to-end robotic control
Semantic Segmentation: Fully Convolutional Networks
FCN:Fully Convolutional Network.
Network designed with all convolutional layers,with downsampling and
upsampling operations
tf.keras.layers.Conv2DTranspose
Long+ CVPR 2015
Facial Detection & Recognition
Self-Driving Cars
Amini+ ICRA 2019.
Self-Driving Cars: Navigation from Visual Perception
Raw
Perception
I
(ex.camera)
Coarse
Maps
M
(ex.GPS)
Possible Control Commands
Amini+ ICRA 2019
End-to-End Framework for Autonomous Navigation
Entire model trained end-to-end
without any human labelling or annotations
Amini+ ICRA 2019
Automatic Colorization of Black and White Images
Optimizing Images
Post Processing Feature Optimization
(Illumination)
Post Processing Feature Optimization
(Color Curves and Details)
Post Processing Feature Optimization
(Color Tone: Warmness)
Up-scaling low-resolution images
Medicine, Biology, Healthcare
Gulshan+ JAMA 2016.
Breast Cancer Screening
6
.
Breast cancer case
missed by radiologist
but detected byAI
AI
MD
Readers
AI
MD
Readers
CNN-based system outperformed expert
radiologists at detecting breast
cancer from mammograms
Semantic Segmentation: Biomedical Image Analysis
BrainTumors
Dong+ MIUA
2017.
Malaria Infection
Soleimany+ arXiv
2019.
Dong+ MIUA 2017;Soleimany+ arXiv 2019
Origi
nal
Ground
Truth
Segmenta
tion
Uncertai
nty
DeepBind
[Alipanahi et al., 2015]
Predicting disease mutations
[Alipanahi et al., 2015]
Today: Convolutional Neural Networks (CNNs)
1. Scene understanding and object recognition for machines (and humans)
– Scene/object recognition challenge. Illusions reveal primitives, conflicting info
– Human neurons/circuits. Visual cortex layers==abstraction. General cognition
2. Classical machine vision foundations: features, scenes, filters, convolution
– Spatial structure primitives: edge detectors & other filters, feature recognition
– Convolution: basics, padding, stride, object recognition, architectures
3. CNN foundations: LeNet, de novo feature learning, parameter sharing
– Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning
– CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected
4. Modern CNN architectures: millions of parameters, dozens of layers
– Feature invariance is hard: apply perturbations, learn for each variation
– ImageNet progression of best performers
– AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU
– VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles
– GoogleNet: new primitive=inception module, 5M params, no FC, efficiency
– ResNet: 152 layers, vanishing gradients  fit residuals to enable learning
5. Countless applications: General architecture, enormous power
– Semantic segmentation, facial detection/recognition, self-driving, image
colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
Deep Learning for Computer Vision: Summary
Foundations
• Why computer vision?
• Representing images
• Convolutions for feature
extraction
CNNs
• CNN architecture
• Application to
classification
• ImageNet
Applications
• Segmentation,image
captioning,control
• Security,medicine,
robotics

More Related Content

PPTX
CONVOLUTIONAL NEURAL NETWORK
PPTX
Convolutional neural network
PPTX
Convolutional Neural Network (CNN) - image recognition
PDF
Convolutional neural network
PPTX
Deep Learning - CNN and RNN
PPTX
CNN Machine learning DeepLearning
PPTX
CNN and its applications by ketaki
CONVOLUTIONAL NEURAL NETWORK
Convolutional neural network
Convolutional Neural Network (CNN) - image recognition
Convolutional neural network
Deep Learning - CNN and RNN
CNN Machine learning DeepLearning
CNN and its applications by ketaki

What's hot (20)

PPTX
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
PPTX
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
PPTX
Transfer Learning and Fine-tuning Deep Neural Networks
PDF
Introduction to Recurrent Neural Network
PDF
Generative adversarial networks
PPT
Cnn method
PPTX
Optimization in Deep Learning
PPTX
Machine Learning - Convolutional Neural Network
PDF
Convolutional Neural Network Models - Deep Learning
PPTX
INTRODUCTION TO NLP, RNN, LSTM, GRU
PPTX
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
PPTX
Feature Selection in Machine Learning
PPTX
Convolutional Neural Network (CNN)
PPTX
Background subtraction
PDF
Deep Learning - Convolutional Neural Networks
PPTX
Transformers in Vision: From Zero to Hero
PPTX
Introduction For seq2seq(sequence to sequence) and RNN
PDF
PPTX
ViT.pptx
PPTX
Intro to deep learning
Convolutional Neural Network - CNN | How CNN Works | Deep Learning Course | S...
Object classification using CNN & VGG16 Model (Keras and Tensorflow)
Transfer Learning and Fine-tuning Deep Neural Networks
Introduction to Recurrent Neural Network
Generative adversarial networks
Cnn method
Optimization in Deep Learning
Machine Learning - Convolutional Neural Network
Convolutional Neural Network Models - Deep Learning
INTRODUCTION TO NLP, RNN, LSTM, GRU
Transfer Learning and Fine Tuning for Cross Domain Image Classification with ...
Feature Selection in Machine Learning
Convolutional Neural Network (CNN)
Background subtraction
Deep Learning - Convolutional Neural Networks
Transformers in Vision: From Zero to Hero
Introduction For seq2seq(sequence to sequence) and RNN
ViT.pptx
Intro to deep learning
Ad

Similar to CNN Algorithm (20)

PPTX
Introduction to Convolutional Neural Networks (CNNs).pptx
PPTX
Introduction to computer vision
PPTX
Introduction to computer vision with Convoluted Neural Networks
PDF
PyDresden 20170824 - Deep Learning for Computer Vision
PPTX
Introduction to Convolutional Neural Networks (CNNs).pptx
PDF
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
PDF
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
PPTX
PyConZA'17 Deep Learning for Computer Vision
PDF
DLD meetup 2017, Efficient Deep Learning
PPTX
Convolutional-Neural-Networks-Revolutionizing-Computer-Vision (1).pptx
PPTX
build a Convolutional Neural Network (CNN) using TensorFlow in Python
PPTX
intro-to-cnn-April_2020.pptx
PDF
Deep Neural Networks Presentation
PPTX
conv_nets.pptx
PDF
物件偵測與辨識技術
PDF
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
PPTX
Introduction to Computer Vision and its Applications
PDF
dl-unit-4-deep-learning deep-learning.pdf
PPTX
Deep learning and computer vision
PPTX
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Introduction to Convolutional Neural Networks (CNNs).pptx
Introduction to computer vision
Introduction to computer vision with Convoluted Neural Networks
PyDresden 20170824 - Deep Learning for Computer Vision
Introduction to Convolutional Neural Networks (CNNs).pptx
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with Convolutional Neural Networks,” a Prese...
PyConZA'17 Deep Learning for Computer Vision
DLD meetup 2017, Efficient Deep Learning
Convolutional-Neural-Networks-Revolutionizing-Computer-Vision (1).pptx
build a Convolutional Neural Network (CNN) using TensorFlow in Python
intro-to-cnn-April_2020.pptx
Deep Neural Networks Presentation
conv_nets.pptx
物件偵測與辨識技術
Convolutional Neural Networks for Image Classification (Cape Town Deep Learni...
Introduction to Computer Vision and its Applications
dl-unit-4-deep-learning deep-learning.pdf
Deep learning and computer vision
Lecture 29 Convolutional Neural Networks - Computer Vision Spring2015
Ad

More from georgejustymirobi1 (20)

PPT
JanData-mining-to-knowledge-discovery.ppt
PDF
Network IP Security.pdf
PPT
How To Write A Scientific Paper
PPT
writing_the_research_paper.ppt
PPT
Bluetooth.ppt
PDF
ABCD15042603583.pdf
PDF
ch18 ABCD.pdf
PPT
ch13 ABCD.ppt
PPT
BluetoothSecurity.ppt
PDF
1682302951397_PGP.pdf
PPTX
applicationlayer.pptx
PDF
Fair Bluetooth.pdf
PPTX
Bluetooth.pptx
PDF
Research Score.pdf
PPT
educational_technology_meena_arora.ppt
PPT
PPTX
PYTHON-PROGRAMMING-UNIT-II (1).pptx
PPT
cprogrammingoperator.ppt
PPT
cprogrammingarrayaggregatetype.ppt
PPTX
TECHNOLOGY_IN_TEACHING_AND_LEARNING.pptx
JanData-mining-to-knowledge-discovery.ppt
Network IP Security.pdf
How To Write A Scientific Paper
writing_the_research_paper.ppt
Bluetooth.ppt
ABCD15042603583.pdf
ch18 ABCD.pdf
ch13 ABCD.ppt
BluetoothSecurity.ppt
1682302951397_PGP.pdf
applicationlayer.pptx
Fair Bluetooth.pdf
Bluetooth.pptx
Research Score.pdf
educational_technology_meena_arora.ppt
PYTHON-PROGRAMMING-UNIT-II (1).pptx
cprogrammingoperator.ppt
cprogrammingarrayaggregatetype.ppt
TECHNOLOGY_IN_TEACHING_AND_LEARNING.pptx

Recently uploaded (20)

PDF
Recent Trends in Network Security - 2025
PPTX
Soumya Das post quantum crypot algorithm
PDF
ForSee by Languify Teardown final product management
PDF
Application of smart robotics in the supply chain
PPTX
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
PDF
BBC NW_Tech Facilities_30 Odd Yrs Ago [J].pdf
PPTX
Embedded Systems Microcontrollers and Microprocessors.pptx
PDF
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
PPTX
Retail.pptx internet of things mtech 2 nd sem
PDF
IoT-Based Hybrid Renewable Energy System.pdf
PPTX
Electric vehicle very important for detailed information.pptx
PPTX
quantum theory on the next future in.pptx
PPTX
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
PPTX
PPT-HEART-DISEASE[1].pptx presentationss
PDF
Thesis of the Fruit Harvesting Robot .pdf
PDF
02. INDUSTRIAL REVOLUTION & Cultural, Technical and territorial transformatio...
PPTX
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
PDF
Water Industry Process Automation & Control Monthly - September 2025
PPTX
Downstream processing_in Module1_25.pptx
PDF
Project_Mgmt_Institute_- Marc Marc Marc.pdf
Recent Trends in Network Security - 2025
Soumya Das post quantum crypot algorithm
ForSee by Languify Teardown final product management
Application of smart robotics in the supply chain
240409 Data Center Training Programs by Uptime Institute (Drafting).pptx
BBC NW_Tech Facilities_30 Odd Yrs Ago [J].pdf
Embedded Systems Microcontrollers and Microprocessors.pptx
August 2025 Top Read Articles in - Bioscience & Engineering Recent Research T...
Retail.pptx internet of things mtech 2 nd sem
IoT-Based Hybrid Renewable Energy System.pdf
Electric vehicle very important for detailed information.pptx
quantum theory on the next future in.pptx
22ME926Introduction to Business Intelligence and Analytics, Advanced Integrat...
PPT-HEART-DISEASE[1].pptx presentationss
Thesis of the Fruit Harvesting Robot .pdf
02. INDUSTRIAL REVOLUTION & Cultural, Technical and territorial transformatio...
sinteringn kjfnvkjdfvkdfnoeneornvoirjoinsonosjf).pptx
Water Industry Process Automation & Control Monthly - September 2025
Downstream processing_in Module1_25.pptx
Project_Mgmt_Institute_- Marc Marc Marc.pdf

CNN Algorithm

  • 1. 6.874, 6.802, 20.390, 20.490, HST.506 Computational Systems Biology Deep Learning in the Life Sciences Lecture 3: Convolutional Neural Networks Prof. Manolis Kellis http://mit6874.github.io Slides credit: 6.S191, Dana Erlich, Param Vir Singh, David Gifford, Alexander Amini, Ava Soleimany
  • 2. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 3. 1a. What do you see, and how? Can we teach machines to see?
  • 4. What do you see?
  • 5. How do you see? How can we help computers see?
  • 6. What computers ‘see’: Images as Numbers What the computer "sees" Levin Image Processing & Computer Vision An image is just a matrix of numbers [0,255].i.e.,1080x1080x3 for an RGB image. Question: is this Lincoln?Washington? Jefferson? Obama? How can the computer answer this question? What you see Input Image Input Image + values Pixel intensity values (“pix-el”=picture-element) What you both see Can I just do classification on the 1,166400-long image vector directly? No. Instead: exploit image spatial structure. Learn patches. Build them up
  • 7. 1b. Classical machine vision roots in study of human/animal brains
  • 8. Inspiration: human/animal visual cortex • Layers of neurons: pixels, edges, shapes, primitives, scenes • E.g. Layer 4 responds to bands w/ given slant, contrasting edges
  • 9. Primitives: Neurons & action potentials •Chemical accumulation across dendritic connections •Pre-synaptic axon  post-synaptic dendrite  neuronal cell body •Each neuron receives multiple signals from its many dendrites •When threshold crossed, it fires •Its axon then sends outgoing signal to downstream neurons •Weak stimuli ignored •Sufficiently strong cross activation threshold •Non-linearity within each neuronal level •Neurons connected into circuits (neural networks): emergent properties, learning, memory •Simple primitives arranged in simple, repetitive, and extremely large networks •86 billion neurons, each connects to 10k neurons, 1 quadrillion (1012) connections
  • 10. Abstraction layers: edges, bars, dir., shapes, objects, scenes LGN: Small dots V1: Orientation, disparity, some color V4: Color, basic shapes, 2D/3D, curvature VTC: Complex features and objects(VTC: ventral temporal cortex •Abstraction layers  visual cortex layers •Complex concepts from simple parts, hierarchy •Primitives of visual concepts encoded in neuronal connection in early cortical layers
  • 11. • Massive recent expanse of human brain has re-used a relatively simple but general learning architecture General “learning machine”, reused widely • Hearing, taste, smell, sight, touch all re- use similar learning architecture Motor Cortex Visual Cortex • Interchangeable circuitry • Auditory cortex learns to ‘see’ if sent visual signals • Injury area tasks shift to uninjured areas • Not fully-general learning, but well-adapted to our world • Humans co-opted this circuitry to many new applications • Modern tasks accessible to any homo sapiens (<70k years) • ML primitives not too different from animals: more to come? human chimp Hardware expansion
  • 12. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 13. 2a. Spatial structure for image recognition
  • 14. Using Spatial Structure Idea: connect patches of input to neurons in hidden layer. Neuron connected to region of input. Only “sees”these values. Input: 2D image. Array of pixel values
  • 15. Using Spatial Structure Connect patch in input layer to a single neuron in subsequent layer. Use a sliding window to define connections. How can we weight the patch to detect particular features?
  • 16. Feature Extraction with Convolution - Filter of size 4x4 :16 different weights - Apply this same filter to 4x4 patches in input - Shift by 2 pixels for next patch This“patchy” operation isconvolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 17. Fully Connected Neural Network Fully Connected: • Each neuron in hidden layer connected to all neurons in input layer • No spatial information • Many, many parameters Input: • 2D image • Vector of pixel values Key idea: Use spatial structure in input to inform architecture of the network
  • 18. High Level Feature Detection Let’s identify key features in each image category Wheels,License Plate, Headlights Door,Windows,Steps Nose,Eyes,Mouth
  • 21. Convolution operation is element wise multiply and add Filter / Kernel
  • 22. Producing Feature Maps Original Sharpen Edge Detect “Strong” Edge Detect
  • 23. A simple pattern: Edges How can we detect edges with a kernel? Input -1 -1 Filter Output (Goodfellow 2016)
  • 24. Simple Kernels / Filters
  • 25. X or X? Image is represented as matrix of pixel values… and computers are literal! We want to be able to classify an X as an X even if it’s shifted,shrunk,rotated, deformed. Rohrer How do CNNs work?
  • 26. There are three approaches to edge cases in convolution
  • 27. (Goodfellow 2016) Zero Padding Controls Output Size • Full convolution: zero pad input so output is produced whenever an output value contains at least one input value (expands output) • Valid-only convolution: output only when entire kernel contained in input (shrinks output) • Same convolution: zero pad input so output is same size as input dimensions x = tf.nn.conv2d(x, W, strides=[1,strides,strides,1],padding='SAME') • TF convolution operator takes stride and zero fill option as parameters • Stride is distance between kernel applications in each dimension • Padding can be SAME or VALID
  • 29. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 30. 3a. Learning Visual Features de novo
  • 31. Key idea: learn hierarchy of features directly from the data (rather than hand-engineering them) Low level features Mid level features High level features Lee+ ICML 2009 Eyes,ears,nose Edges,dark spots Facial structure
  • 32. Key idea: re-use parameters Convolution shares parameters Example 3x3 convolution on a 5x5 image
  • 33. Feature Extraction with Convolution 1) Apply a set of weights – a filter – to extract local features 2) Use multiple filters to extract different features 3) Spatially share parameters of each filter
  • 34. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 35. LeNet-5 ⋮ ⋮ � 𝑦𝑦 32×32×1 28×28×6 14×14×6 10×10×16 5×5×16 120 84 5 × 5 s = 1 f = 2 s = 2 avg pool 5 × 5 s = 1 avg pool f = 2 s = 2 . . . . . . Reminder: Output size = (N+2P-F)/stride + 1 10 conv conv FC FC [LeCun et al., 1998] This slide is taken from Andrew Ng
  • 36. LeNet-5 • Only 60K parameters • As we go deeper in the network: 𝑁𝑁𝐻𝐻 ↓, 𝑁𝑁𝑊𝑊↓, 𝑁𝑁𝐶𝐶 ↑ • General structure: conv->pool->conv->pool->FC->FC->output • Different filters look at different channels • Sigmoid and Tanh nonlinearity [LeCun et al., 1998]
  • 37. Backpropagation of convolution Slide taken from Forward And Backpropagation in Convolutional Neural Network. - Medium
  • 40. Representation Learning in Deep CNNs Mid level features Low level features High level features Edges,dark spots Conv Layer 1 Lee+ ICML 2009 Eyes,ears,nose Conv Layer 2 Facial structure Conv Layer 3
  • 41. CNNs for Classification 1. Convolution:Apply filters to generate feature maps. 2. Non-linearity:Often ReLU. 3. Pooling:Downsampling operation on each feature map. Trainmodel with image data. Learn weights of filters in convolutional layers. tf.keras.layers.Conv2 D tf.keras.activations. * tf.keras.layers.MaxPool2 D
  • 42. Example – Six convolutional layers
  • 43. Convolutional Layers: Local Connectivity For a neuron in hidden layer: - Take inputs from patch - Compute weighted sum - Apply bias tf.keras.layers. Conv2D
  • 44. Convolutional Layers: Local Connectivity For a neuron in hidden layer: • Take inputs from patch • Compute weighted sum • Apply bias 4x4 filter: matrix of weights wij for neuron (p,q) in hidden layer 1) applying a window of weights 2) computing linear combinations 3) activating with non-linear function tf.keras.layers.Conv2D
  • 45. CNNs: Spatial Arrangement of Output Volume depth width height Layer Dimensions: ℎ  w d where h and w are spatial dimensions d (depth) = number of filters Stride: Filter step size Receptive Field: Locations in input image that a node is path connected to tf.keras.layers.Conv2D( filters=d, kernel_size=(h,w), strides=s )
  • 46. Introducing Non-Linearity Rectified Linear Unit (ReLU) - Apply after every convolution operation (i.e.,after convolutional layers) - ReLU:pixel-by-pixel operation that replaces all negative values by zero. - Non-linear operation tf.keras.layers.ReLU Karn Intuitive CNNs
  • 47. Pooling Max Pooling,average pooling 1) Reduced dimensionality 2) Spatial invariance tf.keras.layers.Max Pool2D( pool_size=(2,2), strides=2 )
  • 48. The REctified Linear Unit (RELU) is a common non-linear detector stage after convolution x = tf.nn.conv2d(x, W, strides=[1, strides, strides, 1], padding='SAME') x = tf.nn.bias_add(x, b) x= tf.nn.relu(x) f(x) = max(0, x) When will we backpropagate through this? Once it “dies” what happens to it?
  • 49. Pooling reduces dimensionality by giving up spatial location • max pooling reports the maximum output within a defined neighborhood • Padding can be SAME or VALID x = tf.nn.max_pool(x, ksize=[1, k, k, 1], strides=[1, k, k, 1], padding='SAME') Output Input Pooling Batch H W Input channel Neighborhood [batch, height, width, channels]
  • 51. 91 CNNs for Classification: Feature Learning 1. Learn features in input image through convolution 2. Introduce non-linearity through activation function (real-world data is non-linear!) 3. Reduce dimensionality and preserve spatial invariance with pooling
  • 52. CNNs for Classification: Class Probabilities - CONV and POOL layers output high-level features of input - Fully connected layer uses these features for classifying input image - Express output as probability of image belonging to a particular class
  • 53. Putting it all together import tensorflow as tf def generate_model(): model = tf.keras.Sequential([ # first convolutional layer tf.keras.layers.Conv2D(32, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # second convolutional layer tf.keras.layers.Conv2D(64, filter_size=3, activation='relu’), tf.keras.layers.MaxPool2D(pool_size=2, strides=2), # fully connected classifier tf.keras.layers.Flatten(), tf.keras.layers.Dense(1024, activation='relu’), tf.keras.layers.Dense(10, activation=‘softmax’) # 10 outputs ]) return model
  • 54. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 55. 4a. Real-world feature invariance is hard
  • 56. How can computers recognize objects?
  • 57. How can computers recognize objects? Challenge: • Objects can be anywhere in the scene, in any orientation, rotation, color hue, etc. • How can we overcome this challenge? Answer: • Learn a ton of features (millions) from the bottom up • Learn the convolutional filters, rather than pre-computing them
  • 60. LeNet-5 • Gradient Based Learning Applied To Document Recognition - Y. Lecun, L. Bottou, Y. Bengio, P. Haffner; 1998 • Helped establish how we use CNNs today • Replaced manual feature extraction [LeCun et al., 1998]
  • 61. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 62. AlexNet • ImageNet Classification with Deep Convolutional Neural Networks - Alex Krizhevsky, Ilya Sutskever, Geoffrey E. Hinton; 2012 • Facilitated by GPUs, highly optimized convolution implementation and large datasets (ImageNet) • One of the largest CNNs to date • Has 60 Million parameter compared to 60k parameter of LeNet-5 [Krizhevsky et al., 2012]
  • 63. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners • The annual “Olympics” of computer vision. • Teams from across the world compete to see who has the best computer vision model for tasks such as classification, localization, detection, and more. • 2012 marked the first year where a CNN was used to achieve a top 5 test error rate of 15.3%. • The next best entry achieved an error of 26.2%.
  • 64. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 65. AlexNet [Krizhevsky et al., 2012] Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8 • Input: 227x227x3 images (224x224 before padding) • First layer: 96 11x11 filters applied at stride 4 • Output volume size? (N-F)/s+1 = (227-11)/4+1 = 55 -> [55x55x96] • Number of parameters in this layer? (11*11*3)*96 = 35K Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 67. AlexNet [Krizhevsky et al., 2012] • Input: 227x227x3 images (224x224 before padding) • After CONV1: 55x55x96 • Second layer: 3x3 filters applied at stride 2 • Output volume size? (N-F)/s+1 = (55-3)/2+1 = 27 -> [27x27x96] • Number of parameters in this layer? 0! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Architecture CONV1 MAX POOL1 NORM1 CONV2 MAX POOL2 NORM2 CONV3 CONV4 CONV5 Max POOL3 FC6 FC7 FC8
  • 68. AlexNet . . . 227×227 ×3 55×55 × 96 27×27 ×96 27×27 ×256 13×13 ×256 13×13 ×384 13×13 ×384 13×13 ×256 6×6 ×256 11 × 11 s = 4 P = 0 3 × 3 s = 2 max pool 5 × 5 S = 1 P = 2 3 × 3 s = 2 max pool 3 × 3 S = 1 P = 1 3 × 3 s = 1 P = 1 3 × 3 S = 1 P = 1 3 × 3 s = 2 max pool conv conv conv conv conv . . . [Krizhevsky et al., 2012] . . . This slide is taken from Andrew Ng
  • 69. AlexNet . . . 4096 4096 Softmax 1000 ⋮ ⋮ [Krizhevsky et al., 2012] FC FC This slide is taken from Andrew Ng
  • 70. AlexNet [Krizhevsky et al., 2012] Details/Retrospectives: • first use of ReLU • used Norm layers (not common anymore) • heavy data augmentation • dropout 0.5 • batch size 128 • 7 CNN ensemble Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 71. AlexNet [Krizhevsky et al., 2012] • Trained on GTX 580 GPU with only 3 GB of memory. • Network spread across 2 GPUs, half the neurons (feature maps) on each GPU. • CONV1, CONV2, CONV4, CONV5: Connections only with feature maps on same GPU. • CONV3, FC6, FC7, FC8: Connections with all feature maps in preceding layer, communication across GPUs. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 72. AlexNet AlexNet was the coming out party for CNNs in the computer vision community. This was the first time a model performed so well on a historically difficult ImageNet dataset. This paper illustrated the benefits of CNNs and backed them up with record breaking performance in the competition. [Krizhevsky et al., 2012]
  • 73. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 74. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 75. VGGNet • Very Deep Convolutional Networks For Large Scale Image Recognition - Karen Simonyan and Andrew Zisserman; 2015 • The runner-up at the ILSVRC 2014 competition • Significantly deeper than AlexNet • 140 million parameters [Simonyan and Zisserman, 2014]
  • 76. VGGNet • Smaller filters Only 3x3 CONV filters, stride 1, pad 1 and 2x2 MAX POOL , stride 2 • Deeper network AlexNet: 8 layers VGGNet: 16 - 19 layers • ZFNet: 11.7% top 5 error in ILSVRC’13 • VGGNet: 7.3% top 5 error in ILSVRC’14 Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Simonyan and Zisserman, 2014] Input 3x3 conv, 64 3x3 conv, 64 Pool 1/2 3x3 conv, 128 3x3 conv, 128 Pool 1/2 3x3 conv, 256 3x3 conv, 256 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 1/2 FC 4096 FC 4096 FC 1000 Softmax
  • 77. VGGNet [Simonyan and Zisserman, 2014] • Why use smaller filters? (3x3 conv) Stack of three 3x3 conv (stride 1) layers has the same effective receptive field as one 7x7 conv layer. • What is the effective receptive field of three 3x3 conv (stride 1) layers? 7x7 But deeper, more non-linearities And fewer parameters: 3 * (32C2) vs. 72C2 for C channels per layer Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 78. VGGNet [Simonyan and Zisserman, 2014] VGG16: TOTAL memory: 24M * 4 bytes ~= 96MB / image TOTAL params: 138M parameters Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input 3x3 conv, 64 3x3 conv, 64 Pool 3x3 conv, 128 3x3 conv, 128 Pool 3x3 conv, 256 3x3 conv, 256 3x3 conv, 256 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool 3x3 conv, 512 3x3 conv, 512 3x3 conv, 512 Pool FC 4096 FC 4096 FC 1000 Softmax
  • 79. [Simonyan and Zisserman, 2014] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. Input memory: 224*224*3=150K params: 0 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*3)*64 = 1,728 3x3 conv, 64 memory: 224*224*64=3.2M params: (3*3*64)*64 = 36,864 Pool memory: 112*112*64=800K params: 0 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*64)*128 = 73,728 3x3 conv, 128 memory: 112*112*128=1.6M params: (3*3*128)*128 = 147,456 Pool memory: 56*56*128=400K params: 0 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*128)*256 = 294,912 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 3x3 conv, 256 memory: 56*56*256=800K params: (3*3*256)*256 = 589,824 Pool memory: 28*28*256=200K params: 0 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*256)*512 = 1,179,648 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 28*28*512=400K params: (3*3*512)*512 = 2,359,296 Pool memory: 14*14*512=100K params: 0 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 3x3 conv, 512 memory: 14*14*512=100K params: (3*3*512)*512 = 2,359,296 Pool memory: 7*7*512=25K params: 0 FC 4096 memory: 4096 params: 7*7*512*4096 = 102,760,448 FC 4096 memory: 4096 params: 4096*4096 = 16,777,216 FC 1000 memory: 1000 params: 4096*1000 = 4,096,000
  • 80. VGGNet [Simonyan and Zisserman, 2014] Details/Retrospectives : • ILSVRC’14 2nd in classification, 1st in localization • Similar training procedure as AlexNet • No Local Response Normalisation (LRN) • Use VGG16 or VGG19 (VGG19 only slightly better, more memory) • Use ensembles for best results • FC7 features generalize well to other tasks • Trained on 4 Nvidia Titan Black GPUs for two to three weeks. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 81. VGGNet VGG Net reinforced the notion that convolutional neural networks have to have a deep network of layers in order for this hierarchical representation of visual data to work. Keep it deep. Keep it simple. [Simonyan and Zisserman, 2014]
  • 82. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 83. GoogleNet • Going Deeper with Convolutions - Christian Szegedy et al.; 2015 • ILSVRC 2014 competition winner • Also significantly deeper than AlexNet • x12 less parameters than AlexNet • Focused on computational efficiency [Szegedy et al., 2014]
  • 84. GoogleNet • 22 layers • Efficient “Inception” module - strayed from the general approach of simply stacking conv and pooling layers on top of each other in a sequential structure • No FC layers • Only 5 million parameters! • ILSVRC’14 classification winner (6.7% top 5 error) [Szegedy et al., 2014]
  • 85. GoogleNet “Inception module”: design a good local network topology (network within a network) and then stack these modules on top of each other Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014] Filter concatenation Previous layer 1x1 convolution 3x3 convolution 5x5 convolution 1x1 convolution 1x1 convolution 1x1 convolution 3x3 max pooling
  • 86. GoogleNet Details/Retrospectives : • Deeper networks, with computational efficiency • 22 layers • Efficient “Inception” module • No FC layers • 12x less params than AlexNet • ILSVRC’14 classification winner (6.7% top 5 error) Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [Szegedy et al., 2014]
  • 87. GoogleNet Introduced the idea that CNN layers didn’t always have to be stacked up sequentially. Coming up with the Inception module, the authors showed that a creative structuring of layers can lead to improved performance and computationally efficiency. [Szegedy et al., 2014]
  • 88. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 89. ResNet • Deep Residual Learning for Image Recognition - Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun; 2015 • Extremely deep network – 152 layers • Deeper neural networks are more difficult to train. • Deep networks suffer from vanishing and exploding gradients. • Present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. [He et al., 2015]
  • 90. ResNet • ILSVRC’15 classification winner (3.57% top 5 error, humans generally hover around a 5- 10% error rate) Swept all classification and detection competitions in ILSVRC’15 and COCO’15! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 91. ResNet • What happens when we continue stacking deeper layers on a convolutional neural network? • 56-layer model performs worse on both training and test error -> The deeper model performs worse (not caused by overfitting)! Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 92. ResNet • Hypothesis: The problem is an optimization problem. Very deep networks are harder to optimize. • Solution: Use network layers to fit residual mapping instead of directly trying to fit a desired underlying mapping. • We will use skip connections allowing us to take the activation from one layer and feed it into another layer, much deeper into the network. • Use layers to fit residual F(x) = H(x) – x instead of H(x) directly Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9. [He et al., 2015]
  • 93. ResNet Residual Block Input x goes through conv-relu-conv series and gives us F(x). That result is then added to the original input x. Let’s call that H(x) = F(x) + x. In traditional CNNs, H(x) would just be equal to F(x). So, instead of just computing that transformation (straight from x to F(x)), we’re computing the term that we have to add, F(x), to the input, x. [He et al., 2015]
  • 94. ResNet Short cut/ skip connection 𝑎𝑎[𝑙𝑙] 𝑎𝑎[𝑙𝑙+2] 𝐳𝐳[𝐥𝐥+𝟏𝟏] = 𝐖𝐖[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥] + 𝐛𝐛[𝐥𝐥+𝟏𝟏] 𝐚𝐚[𝐥𝐥+𝟏𝟏] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟏𝟏]) 𝐳𝐳[𝐥𝐥+𝟐𝟐] = 𝐖𝐖[𝐥𝐥+𝟐𝟐]𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠(𝐳𝐳[𝐥𝐥+𝟐𝟐]) 𝑎𝑎[𝑙𝑙+1] a[l] a[l+1] 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋𝐋 𝐑𝐑𝐑𝐑𝐑𝐑𝐑𝐑 a[l+2] 𝐚𝐚[𝐥𝐥+𝟐𝟐] = 𝐠𝐠 𝐳𝐳 𝐥𝐥+𝟐𝟐 + 𝐚𝐚 𝐥𝐥 = 𝐠𝐠(𝐖𝐖[𝐥𝐥+𝟐𝟐] 𝐚𝐚[𝐥𝐥+𝟏𝟏] + 𝐛𝐛[𝐥𝐥+𝟐𝟐] + 𝐚𝐚 𝐥𝐥 ) [He et al., 2015]
  • 95. ResNet Full ResNet architecture: • Stack residual blocks • Every residual block has two 3x3 conv layers • Periodically, double # of filters and downsample spatially using stride 2 (in each dimension) • Additional conv layer at the beginning • No FC layers at the end (only FC 1000 to output classes) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 96. ResNet • Total depths of 34, 50, 101, or 152 layers for ImageNet • For deeper networks (ResNet-50+), use “bottleneck” layer to improve efficiency (similar to GoogLeNet) [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 97. ResNet Experimental Results: • Able to train very deep networks without degrading • Deeper networks now achieve lower training errors as expected [He et al., 2015] Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 98. ResNet The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Even better than human performance! [He et al., 2015]
  • 99. Accuracy comparison The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 100. Forward pass time and power consumption The best CNN architecture that we currently have and is a great innovation for the idea of residual learning. Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 101. ImageNet Large Scale Visual Recognition Challenge (ILSVRC) winners Slide taken from Fei-Fei & Justin Johnson & Serena Yeung. Lecture 9.
  • 102. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 104. An Architecture for Many Applications Detection Semantic segmentation End-to-end robotic control
  • 105. Semantic Segmentation: Fully Convolutional Networks FCN:Fully Convolutional Network. Network designed with all convolutional layers,with downsampling and upsampling operations tf.keras.layers.Conv2DTranspose Long+ CVPR 2015
  • 106. Facial Detection & Recognition
  • 108. Self-Driving Cars: Navigation from Visual Perception Raw Perception I (ex.camera) Coarse Maps M (ex.GPS) Possible Control Commands Amini+ ICRA 2019
  • 109. End-to-End Framework for Autonomous Navigation Entire model trained end-to-end without any human labelling or annotations Amini+ ICRA 2019
  • 110. Automatic Colorization of Black and White Images
  • 111. Optimizing Images Post Processing Feature Optimization (Illumination) Post Processing Feature Optimization (Color Curves and Details) Post Processing Feature Optimization (Color Tone: Warmness)
  • 114. Breast Cancer Screening 6 . Breast cancer case missed by radiologist but detected byAI AI MD Readers AI MD Readers CNN-based system outperformed expert radiologists at detecting breast cancer from mammograms
  • 115. Semantic Segmentation: Biomedical Image Analysis BrainTumors Dong+ MIUA 2017. Malaria Infection Soleimany+ arXiv 2019. Dong+ MIUA 2017;Soleimany+ arXiv 2019 Origi nal Ground Truth Segmenta tion Uncertai nty
  • 118. Today: Convolutional Neural Networks (CNNs) 1. Scene understanding and object recognition for machines (and humans) – Scene/object recognition challenge. Illusions reveal primitives, conflicting info – Human neurons/circuits. Visual cortex layers==abstraction. General cognition 2. Classical machine vision foundations: features, scenes, filters, convolution – Spatial structure primitives: edge detectors & other filters, feature recognition – Convolution: basics, padding, stride, object recognition, architectures 3. CNN foundations: LeNet, de novo feature learning, parameter sharing – Key ideas: learn features, hierarchy, re-use parameters, back-prop filter learning – CNN formalization: representations(Conv+ReLU+Pool)*N layers + Fully-connected 4. Modern CNN architectures: millions of parameters, dozens of layers – Feature invariance is hard: apply perturbations, learn for each variation – ImageNet progression of best performers – AlexNet: First top performer CNN, 60M parameters (from 60k in LeNet-5), ReLU – VGGNet: simpler but deeper (819 layers), 140M parameters, ensembles – GoogleNet: new primitive=inception module, 5M params, no FC, efficiency – ResNet: 152 layers, vanishing gradients  fit residuals to enable learning 5. Countless applications: General architecture, enormous power – Semantic segmentation, facial detection/recognition, self-driving, image colorization, optimizing pictures/scenes, up-scaling, medicine, biology, genomics
  • 119. Deep Learning for Computer Vision: Summary Foundations • Why computer vision? • Representing images • Convolutions for feature extraction CNNs • CNN architecture • Application to classification • ImageNet Applications • Segmentation,image captioning,control • Security,medicine, robotics