0% found this document useful (0 votes)
4 views36 pages

WINSEM2024-25 BMEE407L TH VL2024250503563 2025-03-28 Reference-Material-I

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views36 pages

WINSEM2024-25 BMEE407L TH VL2024250503563 2025-03-28 Reference-Material-I

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

Convolutional Neural Networks (CNNs)

Convolutional Neural Networks

Convolutional neural networks is a type of supervised deep learning algorithm that


use three-dimensional data for image classification, speech, audio and object
recognition tasks.
• Prior to CNNs, manual, time-consuming feature extraction methods were used to identify
objects in images.
• CNNs now provide a more scalable approach to image classification and object recognition
tasks, leveraging principles from linear algebra, specifically matrix multiplication, to
identify patterns within an image.
• They can be computationally demanding, requiring graphical processing units (GPUs) to
train models.
They have three main types of layers, which are:
• Convolutional layer
• Pooling layer
• Fully-connected (FC) layer
The convolutional layer is the first layer of a convolutional network. While
convolutional layers can be followed by additional convolutional layers or pooling
layers, the fully-connected layer is the final layer.
65
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
 Allows parameter sharing
 Efficient to train
 Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image

Convolutional
Input matrix 3x3 filter

Picture from: http://deeplearning.stanford.edu/wiki/index.php/Feature_extraction_using_convolution 66


Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• When the convolutional filters are scanned over the image, they capture useful
features
 E.g., edge detection by convolutions

0 1 0
Filter 1 -4 1
0 1 0
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1

Input Image Convoluted Image

Slide credit: Param Vir Singh – Deep Learning 67


Convolutional Neural Networks (CNNs)
Convolutional layer
The convolutional layer is the core building block of a CNN, and it is where the
majority of computation occurs.
It requires input data, a filter, and a feature map.
E.g. Take a color image as an input
• a matrix of pixels in 3D—a height, width, and depth—RGB in an image.
• a feature detector, also known as a kernel or a filter, which will move across the receptive
fields of the image, checking if the feature is present. This process is known as a
convolution.

68
Convolutional Neural Networks (CNNs)
Convolutional layer
• The feature detector is a two-dimensional (2-D) array of weights, which represents part of
the image. While they can vary in size, the filter size is typically a 3x3 matrix;
• The filter is then applied to an area of the image, and a dot product is calculated between
the input pixels and the filter.
• This dot product is then fed into an output array.
• Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept
across the entire image.
• The final output from the series of dot products from the input and the filter is known as a
feature map, activation map, or a convolved feature.
• Note that the weights in the feature detector remain fixed as it moves across the image,
which is also known as parameter sharing.

69
Convolutional Neural Networks (CNNs)
Convolutional layer
The weight values adjust during training through the process of backpropagation and
gradient descent. However, there are three hyperparameters which affect the volume
size of the output that need to be set before the training of the neural network begins.
These include:
1. The number of filters affects the depth of the output. For example, three distinct
filters would yield three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input
matrix. While stride values of two or greater is rare, a larger stride yields a smaller
output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all
elements that fall outside of the input matrix to zero, producing a larger or equally
sized output.

After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation
to the feature map, introducing nonlinearity to the model.

70
Convolutional Neural Networks (CNNs)
Pooling layers
Pooling layers, also known as downsampling, conducts dimensionality reduction,
reducing the number of parameters in the input.
Similar to the convolutional layer, the pooling operation sweeps a filter across the
entire input, but the difference is that this filter does not have any weights.
Instead, the kernel applies an aggregation function to the values within the
receptive field, populating the output array.

There are two main types of pooling:


•Max pooling: As the filter moves across the input, it selects the pixel with the
maximum value to send to the output array. As an aside, this approach tends to be
used more often compared to average pooling.
•Average pooling: As the filter moves across the input, it calculates the average
value within the receptive field to send to the output array.
While a lot of information is lost in the pooling layer, it also has a number of
benefits to the CNN. They help to reduce complexity, improve efficiency, and limit
risk of overfitting.

71
Convolutional Neural Networks (CNNs)
Fully-connected layer

• In the fully-connected layer, each node in the output layer connects directly to a
node in the previous layer.
• This layer performs the task of classification based on the features extracted
through the previous layers and their different filters.
• While convolutional and pooling layers tend to use ReLu functions, FC layers
usually leverage a softmax activation function to classify inputs appropriately,
producing a probability from 0 to 1.

72
Steps for creating a CNN for image
recognition

1.Image channels
2.Convolution
3.Pooling
4.Flattening
5.Full connection

73
CNN - Image channels
Image channels
• Find a way to represent the image in a numerical format - making an image
compatible with the CNN algorithm.
• The image is represented using its pixel - mapped to a number between 0 and 255.
• Each number represents a color ranging between 0 for white and 255 for black.

For a black and white image, an image with length m and width n is represented as a
2-D array of size . Each cell within this array contains its corresponding pixel
value.

For a colored image of the same size, a 3-D array is used. Each pixel from the image
is represented by its corresponding pixel values in three different channels, each
pertaining to a red, blue, and green channel.

74
CNN - Image channels
Image channels Image channels

The image is represented as a 3-dimensional array, with each


channel representing red, green, and blue values, respectively,
as shown in the image.
75
CNN - Convolution
Convolution

Now that the image has been represented as a combination of numbers, the next step
in the process is to identify the key features within the image.

This is extracted using a method known as convolution.

• Convolution is an operation where one function modifies (or convolves) the shape
of another.
• Convolutions in images are generally applied for various reasons such as to
sharpen, smooth, and intensify.
• In CNN, convolutions are applied to extract the prominent features within the
images.

76
CNN - Convolution
How are features detected
To extract key features within an image, a filter or a kernel is used. A filter is an
array that represents the feature to be extracted.
• This filter is strided over the input array, and the resulting convolution is a 2-D array that
contains the correlation of the image with respect to the filter that was applied.
• The output array is referred to as the feature map.
For simplicity, the following animation shows how an edge detector filter is applied
to just the blue channel output from the previous step.

77
CNN - Convolution
How are features detected

The resulting image contains just the edges present in the original input. The filter
used in the previous example is of size and is applied to the input image of
size . The resulting feature map is of size . In summary, for an input
image of size and a filter of size ,
the resulting output is of size .

78
CNN - Convolution
Strided convolutions
During the process of convolution, you can see how the input array is transformed
into a smaller array while still maintaining the spatial correlation between the pixels
by applying filters. How to help compress the size of the input array much more?
• In the previous section, you saw how the filter is applied to each 3x3 section of the input
image. You can see that this window is slid by one column to the right each time, and the
end of each row is slid down by one row.
• In this case, sliding of the filter over the input was done one step at a time. This is referred
to as striding. The following example shows the same convolution, but strided with 2 steps.

For an input image of size and a filter of size with stride = , the
resulting output will be of size .
79
CNN - Convolution
Padding
• During convolution, notice that the size of the feature map is reduced drastically
when compared to the input. Also, notice that the filter stride touches the cells in
the corners just once, but the cells to the center are filtered quite a few times.
• To ensure that the size of the feature map retains its original input size and
enables equal assessment of all pixels, you apply one or more layers of padding to
the original input array. Padding refers to the process of adding extra layers of
zeros to the outer rows and columns of the input array.

80
CNN - Convolution
Padding

The below image shows how 1 layer of padding is added to the input array before a
filter is applied. You can see that for an input array of size , padding is set to
one, filter is size , and the output is a array. In general, for an input
image of size and a filter of size with padding = p, the resulting output
is of size .

81
CNN - Convolution
How are convolutions applied over the RGB channels
• For a image represented over 3 channels, the filter is now replicated three
times, once for each channel.
• The input image is a array, and the filter is a array. However, the
output map is still a 2D array.
• The convolutions on the same pixel through the different channel are added and are
collectively represented within each cell.

82
CNN - Convolution
How are convolutions applied over the RGB channels
For an input image of size and filter of size over N channel, the image
and filters are converted into arrays of sizes and , respectively,
and the feature map produced is of size assuming
stride=1.

83
CNN - Convolution
How are convolutions applied to more than one filter
In reality, the CNN model needs to use multiple filters at the same time to observe
and extract key features

84
CNN - Convolution
How are convolutions applied to more than one filter
In the below image, you see that applying convolution using three filters over the RGB
channels produces three arrays. Thus, for an input image of size and filter
of size over N channel and F filters, the feature map produced is of size
assuming that stride=1.

85
CNN - Pooling layers
Pooling layers
To further reduce the size of the feature map generated from convolution, pooling is
applied before further processing. This helps to further compress the dimensions of
the feature map. For this reason, pooling is also referred to
as subsampling/downsampling.

• Pooling is the process of summarizing the features within a group of cells in the
feature map.
• This summary of cells can be acquired by taking the maximum, minimum, or
average within a group of cells.
• Each of these methods is referred to as min, max, and average pooling,
respectively.

86
CNN - Pooling layers
Pooling layers
• Max pooling: reports the maximum output within a rectangular neighborhood
• Average pooling: reports the average output of a rectangular neighborhood
• Pooling layers reduce the spatial size of the feature maps
 Reduce the number of parameters, prevent overfitting

MaxPool with a 2×2 filter with stride of 2


1 3 5 3
4 5
4 2 3 1
3 4
3 1 1 3
0 1 0 4
Output Matrix
Input Matrix

Slide credit: Param Vir Singh – Deep Learning 87


CNN - Pooling layers
Pooling layers
• The below image shows how max pooling is applied to a filter of size 2 ( )
and stride=1.
• This means that for every cell group within the feature map, the
maximum value within this region is extracted into the output cell.
• It should be noted that the example shows a pooling-applied outcome of just
one feature map. However, in CNN, pooling is applied to feature maps that
result from each filter.

88
CNN - Flattening
Flattening
You can think of CNN as a sequence of steps that are performed to effectively capture
the important aspects of an image before applying ANN on it. In the previous steps,
you saw the different transitions that are applied to the original image.

89
CNN - Flattening
Flattening
The final step in this process is to make the outcomes of CNN be compatible with an
ANN. The inputs to ANN should be in the form of a vector. To support that, flattening
is applied, which is the step to convert the multidimensional array into an vector,
as shown previously. (n=total numbers of elements in the feature map)

Note that this example


shows flattening applied to
just one feature map.

However, in CNN,
flattening is applied to
feature maps that result
from each filter.

90
CNN - Full connection
Full connection: a simple convolutional network
The following image shows a sample CNN that is built to recognize an apple image.
To begin, the original input image is represented using its pixel values over the RGB
channel. Convolutions and pooling are then applied to help identify feature mapping
and compress the data further.

Note that convolutions and pooling can be applied to CNN many times. The metrics
of the model generated depends on finding the right number of times these steps
should be repeated.

91
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks

• Feature extraction architecture


 After 2 convolutional layers, a max-pooling layer reduces the size of the feature maps
(typically by 2)
 A fully convolutional and a softmax layers are added last to perform classification

Living Room

Bedroom

Kitchen
128

256
256

512
512

512
512
128

256

512

512
64
64

Bathroom

Outdoor
Conv layer

Max Pool

Fully Connected Layer

Slide credit: Param Vir Singh – Deep Learning 92


Recurrent Neural Networks (RNNs)
Recurrent Neural Networks

• Recurrent NNs are used for modeling sequential data or time series and data
with varying length of inputs and outputs
 Videos, text, speech, DNA sequences, human skeletal data
• Like feedforward and CNNs, RNNs utilize training data to learn. They are distinguished
by their “memory” as they take information from prior inputs to influence the current
input and output.
• RNNs introduce recurrent connections between the neurons
 This allows processing sequential data one element at a time by selectively passing
information across a sequence
 Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
 Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs
• Examples: Siri, voice search, and Google Translate etc.

93
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks

Recurrent networks share parameters across each layer of the network.


 While feedforward networks have different weights across each node, RNNs share the
same weight parameter within each layer of the network.
 That said, these weights are still adjusted in the through the processes of
backpropagation and gradient descent to facilitate reinforcement learning.

94
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks

• RNN use same set of weights and across all time steps
 A sequence of hidden states is learned, which represents the memory
of the network
 The hidden state at step t, , is calculated based on the previous hidden state
and the input at the current step , i.e.,
 The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time

HIDDEN STATES SEQUENCE: OUTPUT

h0 (·) h1 (·) h2 (·) h3 (·)

x1 x2 x3

INPUT SEQUENCE:
Slide credit: Param Vir Singh – Deep Learning 95
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks

• RNNs can have one of many inputs and one of many outputs

RNN Application Input Output

A person riding a
Image
motorbike on dirt
Captioning
road

Sentiment Awesome movie. Highly


Analysis recommended. Positive

Machine
Happy Diwali शभ
ु द पावल
Translation

Slide credit: Param Vir SIngh– Deep Learning 96


Bidirectional RNNs
Recurrent Neural Networks

• Bidirectional RNNs incorporate both forward and backward passes through


sequential data
 The output may not only depend on the previous elements in the sequence, but also
on future elements in the sequence
 It resembles two RNNs stacked on top of each other

( ) ( )

( ) ( )

Outputs both past and future elements

Slide credit: Param Vir Singh – Deep Learning 97


LSTM Networks
Recurrent Neural Networks

• Long Short-Term Memory (LSTM) networks are a variant of RNNs


• LSTM mitigates the vanishing/exploding gradient problem
 Solution: a Memory Cell, updated at each step in the sequence
• Three gates control the flow of information to and from the Memory Cell
 Input Gate: protects the current step from irrelevant inputs
 Output Gate: prevents current step from passing irrelevant information to later steps
 Forget Gate: limits information passed from one cell to the next
• Most modern RNN models use either LSTM units or other more advanced types
of recurrent units (e.g., GRU units)

98
LSTM Networks
Recurrent Neural Networks

• LSTM cell
 Input gate, output gate, forget gate, memory cell
 LSTM can learn long-term correlations within data sequences

99
References

1. Hung-yi Lee – Deep Learning Tutorial


2. Ismini Lourentzou – Introduction to Deep Learning
3. CS231n Convolutional Neural Networks for Visual Recognition (Stanford CS
course) (link)
4. James Hays, Brown – Machine Learning Overview
5. Param Vir Singh, Shunyuan Zhang, Nikhil Malik – Deep Learning
6. Sebastian Ruder – An Overview of Gradient Descent Optimization Algorithms
(link)

100

You might also like