Evolution of CNN Architecture
Introduction
• A Convolutional Neural Network (CNN, or ConvNet) are a
special kind of multi-layer neural networks, designed to
recognize visual patterns directly from pixel images with
minimal preprocessing.
• The ImageNet project is a large visual database designed
for use in visual object recognition software research.
• The ImageNet project runs an annual software contest, the
ImageNet Large Scale Visual Recognition Challenge
(ILSVRC), where software programs compete to correctly
classify and detect objects and scenes.
• Training set of 1.2M (7321300 training samples per class)
labelled images from 1000 categories
LeNet in 1998
• LeNet is a 7-level convolutional network by
LeCun in 1998 that classifies digits and used
by several banks to recognize hand-written
numbers on cheques digitized in 32x32 pixel
greyscale input images.
• The ability to process higher resolution images
requires larger and more convolutional layers,
so this technique is constrained by the
availability of computing resources.
AlexNet in 2012
• 8-layer CNN: 5 Conv layers, 3 FC layers
• 227×227 input
• Max pooling, ReLU nonlinearity
• Implemented on GTX 580 GPUs
Structure of AlexNet
ZFNet in 2013
• ZFNet is a modified version of AlexNet which gives a
better accuracy.
• One major difference in the approaches was that ZF
Net used 7x7 sized filters whereas AlexNet used 11x11
filters.
• By using bigger filters, there may be a chance to loose
a lot of pixel information, which we can retain by
having smaller filter sizes in the earlier conv layers.
• The number of filters increase as we go deeper.
• This network also used ReLUs for their activation and
trained using batch stochastic gradient descent.
Architecture of ZFNet
Architecture of ZFNet
• A 224 by 224 crop of an image is presented as the input
• This is convolved with 96 different 1st layer filters, each of
size 7 by 7 using a stride of 2 in both x and y.
• The resulting feature maps are then:
– Passed through a rectified Linear function
– Pooled (max, within 3x3 regions, using stride 2)
– Contrast normalized across feature maps to give 96 different 55
by 55 element feature maps.
• Similar operations repeated in layers 2,3,4,5
• The last two layers are fully connected, taking features from
the top convolutional layer as input.
• The final layer is a C-way softmax function, C being number
of classes.
VGG in 2014
• VGG Net used 3x3 filters compared to 11x11
filters in AlexNet and 7x7 in ZFNet.
• Having two consecutive ive 3x3 filters gives an
effective receptive field of 5x5, and 3 – 3x3
filters give a receptive field of 7x7 filters.
GoogleNet in 2014
• GoogLeNet proposed a module called the
inception modules which includes skip
connections in the network forming a mini
module and this module is repeated
throughout the network.
• GoogLeNet uses 9 inception module and it
eliminates all fully connected layers using
average pooling to go from 7x7x1024 to
1x1x1024. This saves a lot of parameters.
GoogleNet in 2014
ResNet in 2015
• There are 152 layers in the Microsoft ResNet.
• If the number of layers are increased then the
error rate should keep on decreasing
Comparison