WINSEM2024-25 BMEE407L TH VL2024250503563 2025-03-28 Reference-Material-I
WINSEM2024-25 BMEE407L TH VL2024250503563 2025-03-28 Reference-Material-I
• Convolutional neural networks (CNNs) were primarily designed for image data
• CNNs use a convolutional operator for extracting data features
Allows parameter sharing
Efficient to train
Have less parameters than NNs with fully-connected layers
• CNNs are robust to spatial translations of objects in images
• A convolutional filter slides (i.e., convolves) across the image
Convolutional
Input matrix 3x3 filter
• When the convolutional filters are scanned over the image, they capture useful
features
E.g., edge detection by convolutions
0 1 0
Filter 1 -4 1
0 1 0
1 1 1 1 1 1 0.015686 0.015686 0.011765 0.015686 0.015686 0.015686 0.015686 0.964706 0.988235 0.964706 0.866667 0.031373 0.023529 0.007843
0.007843 0.741176 1 1 0.984314 0.023529 0.019608 0.015686 0.015686 0.015686 0.011765 0.101961 0.972549 1 1 0.996078 0.996078 0.996078 0.058824 0.015686
0.019608 0.513726 1 1 1 0.019608 0.015686 0.015686 0.015686 0.007843 0.011765 1 1 1 0.996078 0.031373 0.015686 0.019608 1 0.011765
0.015686 0.733333 1 1 0.996078 0.019608 0.019608 0.015686 0.015686 0.011765 0.984314 1 1 0.988235 0.027451 0.015686 0.007843 0.007843 1 0.352941
0.015686 0.823529 1 1 0.988235 0.019608 0.019608 0.015686 0.015686 0.019608 1 1 0.980392 0.015686 0.015686 0.015686 0.015686 0.996078 1 0.996078
0.015686 0.913726 1 1 0.996078 0.019608 0.019608 0.019608 0.019608 1 1 0.984314 0.015686 0.015686 0.015686 0.015686 0.952941 1 1 0.992157
0.019608 0.913726 1 1 0.988235 0.019608 0.019608 0.019608 0.039216 0.996078 1 0.015686 0.015686 0.015686 0.015686 0.996078 1 1 1 0.007843
0.019608 0.898039 1 1 0.988235 0.019608 0.015686 0.019608 0.968628 0.996078 0.980392 0.027451 0.015686 0.019608 0.980392 0.972549 1 1 1 0.019608
0.043137 0.905882 1 1 1 0.015686 0.035294 0.968628 1 1 0.023529 1 0.792157 0.996078 1 1 0.980392 0.992157 0.039216 0.023529
1 1 1 1 1 0.992157 0.992157 1 1 0.984314 0.015686 0.015686 0.858824 0.996078 1 0.992157 0.501961 0.019608 0.019608 0.023529
0.996078 0.992157 1 1 1 0.933333 0.003922 0.996078 1 0.988235 1 0.992157 1 1 1 0.988235 1 1 1 1
0.015686 0.74902 1 1 0.984314 0.019608 0.019608 0.031373 0.984314 0.023529 0.015686 0.015686 1 1 1 0 0.003922 0.027451 0.980392 1
0.019608 0.023529 1 1 1 0.019608 0.019608 0.564706 0.894118 0.019608 0.015686 0.015686 1 1 1 0.015686 0.015686 0.015686 0.05098 1
0.015686 0.015686 1 1 1 0.047059 0.019608 0.992157 0.007843 0.011765 0.011765 0.015686 1 1 1 0.015686 0.019608 0.996078 0.023529 0.996078
0.019608 0.015686 0.243137 1 1 0.976471 0.035294 1 0.003922 0.011765 0.011765 0.015686 1 1 1 0.988235 0.988235 1 0.003922 0.015686
0.019608 0.019608 0.027451 1 1 0.992157 0.223529 0.662745 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.023529 0.996078 0.011765 0.011765
0.015686 0.015686 0.011765 1 1 1 1 0.035294 0.011765 0.011765 0.011765 0.015686 1 1 1 0.015686 0.015686 0.964706 0.003922 0.996078
0.007843 0.019608 0.011765 0.054902 1 1 0.988235 0.007843 0.011765 0.011765 0.015686 0.011765 1 1 1 0.015686 0.015686 0.015686 0.023529 1
0.007843 0.007843 0.015686 0.015686 0.960784 1 0.490196 0.015686 0.015686 0.015686 0.007843 0.027451 1 1 1 0.011765 0.011765 0.043137 1 1
0.023529 0.003922 0.007843 0.023529 0.980392 0.976471 0.039216 0.019608 0.007843 0.019608 0.015686 1 1 1 1 1 1 1 1 1
68
Convolutional Neural Networks (CNNs)
Convolutional layer
• The feature detector is a two-dimensional (2-D) array of weights, which represents part of
the image. While they can vary in size, the filter size is typically a 3x3 matrix;
• The filter is then applied to an area of the image, and a dot product is calculated between
the input pixels and the filter.
• This dot product is then fed into an output array.
• Afterwards, the filter shifts by a stride, repeating the process until the kernel has swept
across the entire image.
• The final output from the series of dot products from the input and the filter is known as a
feature map, activation map, or a convolved feature.
• Note that the weights in the feature detector remain fixed as it moves across the image,
which is also known as parameter sharing.
69
Convolutional Neural Networks (CNNs)
Convolutional layer
The weight values adjust during training through the process of backpropagation and
gradient descent. However, there are three hyperparameters which affect the volume
size of the output that need to be set before the training of the neural network begins.
These include:
1. The number of filters affects the depth of the output. For example, three distinct
filters would yield three different feature maps, creating a depth of three.
2. Stride is the distance, or number of pixels, that the kernel moves over the input
matrix. While stride values of two or greater is rare, a larger stride yields a smaller
output.
3. Zero-padding is usually used when the filters do not fit the input image. This sets all
elements that fall outside of the input matrix to zero, producing a larger or equally
sized output.
After each convolution operation, a CNN applies a Rectified Linear Unit (ReLU) transformation
to the feature map, introducing nonlinearity to the model.
70
Convolutional Neural Networks (CNNs)
Pooling layers
Pooling layers, also known as downsampling, conducts dimensionality reduction,
reducing the number of parameters in the input.
Similar to the convolutional layer, the pooling operation sweeps a filter across the
entire input, but the difference is that this filter does not have any weights.
Instead, the kernel applies an aggregation function to the values within the
receptive field, populating the output array.
71
Convolutional Neural Networks (CNNs)
Fully-connected layer
• In the fully-connected layer, each node in the output layer connects directly to a
node in the previous layer.
• This layer performs the task of classification based on the features extracted
through the previous layers and their different filters.
• While convolutional and pooling layers tend to use ReLu functions, FC layers
usually leverage a softmax activation function to classify inputs appropriately,
producing a probability from 0 to 1.
72
Steps for creating a CNN for image
recognition
1.Image channels
2.Convolution
3.Pooling
4.Flattening
5.Full connection
73
CNN - Image channels
Image channels
• Find a way to represent the image in a numerical format - making an image
compatible with the CNN algorithm.
• The image is represented using its pixel - mapped to a number between 0 and 255.
• Each number represents a color ranging between 0 for white and 255 for black.
For a black and white image, an image with length m and width n is represented as a
2-D array of size . Each cell within this array contains its corresponding pixel
value.
For a colored image of the same size, a 3-D array is used. Each pixel from the image
is represented by its corresponding pixel values in three different channels, each
pertaining to a red, blue, and green channel.
74
CNN - Image channels
Image channels Image channels
Now that the image has been represented as a combination of numbers, the next step
in the process is to identify the key features within the image.
• Convolution is an operation where one function modifies (or convolves) the shape
of another.
• Convolutions in images are generally applied for various reasons such as to
sharpen, smooth, and intensify.
• In CNN, convolutions are applied to extract the prominent features within the
images.
76
CNN - Convolution
How are features detected
To extract key features within an image, a filter or a kernel is used. A filter is an
array that represents the feature to be extracted.
• This filter is strided over the input array, and the resulting convolution is a 2-D array that
contains the correlation of the image with respect to the filter that was applied.
• The output array is referred to as the feature map.
For simplicity, the following animation shows how an edge detector filter is applied
to just the blue channel output from the previous step.
77
CNN - Convolution
How are features detected
The resulting image contains just the edges present in the original input. The filter
used in the previous example is of size and is applied to the input image of
size . The resulting feature map is of size . In summary, for an input
image of size and a filter of size ,
the resulting output is of size .
78
CNN - Convolution
Strided convolutions
During the process of convolution, you can see how the input array is transformed
into a smaller array while still maintaining the spatial correlation between the pixels
by applying filters. How to help compress the size of the input array much more?
• In the previous section, you saw how the filter is applied to each 3x3 section of the input
image. You can see that this window is slid by one column to the right each time, and the
end of each row is slid down by one row.
• In this case, sliding of the filter over the input was done one step at a time. This is referred
to as striding. The following example shows the same convolution, but strided with 2 steps.
For an input image of size and a filter of size with stride = , the
resulting output will be of size .
79
CNN - Convolution
Padding
• During convolution, notice that the size of the feature map is reduced drastically
when compared to the input. Also, notice that the filter stride touches the cells in
the corners just once, but the cells to the center are filtered quite a few times.
• To ensure that the size of the feature map retains its original input size and
enables equal assessment of all pixels, you apply one or more layers of padding to
the original input array. Padding refers to the process of adding extra layers of
zeros to the outer rows and columns of the input array.
80
CNN - Convolution
Padding
The below image shows how 1 layer of padding is added to the input array before a
filter is applied. You can see that for an input array of size , padding is set to
one, filter is size , and the output is a array. In general, for an input
image of size and a filter of size with padding = p, the resulting output
is of size .
81
CNN - Convolution
How are convolutions applied over the RGB channels
• For a image represented over 3 channels, the filter is now replicated three
times, once for each channel.
• The input image is a array, and the filter is a array. However, the
output map is still a 2D array.
• The convolutions on the same pixel through the different channel are added and are
collectively represented within each cell.
82
CNN - Convolution
How are convolutions applied over the RGB channels
For an input image of size and filter of size over N channel, the image
and filters are converted into arrays of sizes and , respectively,
and the feature map produced is of size assuming
stride=1.
83
CNN - Convolution
How are convolutions applied to more than one filter
In reality, the CNN model needs to use multiple filters at the same time to observe
and extract key features
84
CNN - Convolution
How are convolutions applied to more than one filter
In the below image, you see that applying convolution using three filters over the RGB
channels produces three arrays. Thus, for an input image of size and filter
of size over N channel and F filters, the feature map produced is of size
assuming that stride=1.
85
CNN - Pooling layers
Pooling layers
To further reduce the size of the feature map generated from convolution, pooling is
applied before further processing. This helps to further compress the dimensions of
the feature map. For this reason, pooling is also referred to
as subsampling/downsampling.
• Pooling is the process of summarizing the features within a group of cells in the
feature map.
• This summary of cells can be acquired by taking the maximum, minimum, or
average within a group of cells.
• Each of these methods is referred to as min, max, and average pooling,
respectively.
86
CNN - Pooling layers
Pooling layers
• Max pooling: reports the maximum output within a rectangular neighborhood
• Average pooling: reports the average output of a rectangular neighborhood
• Pooling layers reduce the spatial size of the feature maps
Reduce the number of parameters, prevent overfitting
88
CNN - Flattening
Flattening
You can think of CNN as a sequence of steps that are performed to effectively capture
the important aspects of an image before applying ANN on it. In the previous steps,
you saw the different transitions that are applied to the original image.
89
CNN - Flattening
Flattening
The final step in this process is to make the outcomes of CNN be compatible with an
ANN. The inputs to ANN should be in the form of a vector. To support that, flattening
is applied, which is the step to convert the multidimensional array into an vector,
as shown previously. (n=total numbers of elements in the feature map)
However, in CNN,
flattening is applied to
feature maps that result
from each filter.
90
CNN - Full connection
Full connection: a simple convolutional network
The following image shows a sample CNN that is built to recognize an apple image.
To begin, the original input image is represented using its pixel values over the RGB
channel. Convolutions and pooling are then applied to help identify feature mapping
and compress the data further.
Note that convolutions and pooling can be applied to CNN many times. The metrics
of the model generated depends on finding the right number of times these steps
should be repeated.
91
Convolutional Neural Networks (CNNs)
Convolutional Neural Networks
Living Room
Bedroom
Kitchen
128
256
256
512
512
512
512
128
256
512
512
64
64
Bathroom
Outdoor
Conv layer
Max Pool
• Recurrent NNs are used for modeling sequential data or time series and data
with varying length of inputs and outputs
Videos, text, speech, DNA sequences, human skeletal data
• Like feedforward and CNNs, RNNs utilize training data to learn. They are distinguished
by their “memory” as they take information from prior inputs to influence the current
input and output.
• RNNs introduce recurrent connections between the neurons
This allows processing sequential data one element at a time by selectively passing
information across a sequence
Memory of the previous inputs is stored in the model’s internal state and affect the
model predictions
Can capture correlations in sequential data
• RNNs use backpropagation-through-time for training
• RNNs are more sensitive to the vanishing gradient problem than CNNs
• Examples: Siri, voice search, and Google Translate etc.
93
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
94
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
• RNN use same set of weights and across all time steps
A sequence of hidden states is learned, which represents the memory
of the network
The hidden state at step t, , is calculated based on the previous hidden state
and the input at the current step , i.e.,
The function is a nonlinear activation function, e.g., ReLU or tanh
• RNN shown rolled over time
x1 x2 x3
INPUT SEQUENCE:
Slide credit: Param Vir Singh – Deep Learning 95
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks
• RNNs can have one of many inputs and one of many outputs
A person riding a
Image
motorbike on dirt
Captioning
road
Machine
Happy Diwali शभ
ु द पावल
Translation
( ) ( )
( ) ( )
98
LSTM Networks
Recurrent Neural Networks
• LSTM cell
Input gate, output gate, forget gate, memory cell
LSTM can learn long-term correlations within data sequences
99
References
100