Unit-2
Deep Learning
Introduction to CNNs
Lets, see the working flow of CNN based on the image of the flower:-
Another name of Convolutional Neural Network is ConvNet.
Working flow of CNN in pictorial representation:-
Now, suppose I want to see the output in probability format, then in that case, I will be using an activation function
before output. Since I am dealing with house and tree simultaneously, in that case, I will be using softmax. If I am
only dealing with a house or tree, then I will be using sigmoid, but in this case, I am dealing with a house and tree,
because house and tree are both important feature. So, I will be using softmax here.
Here, the machine will automatially calculate the raw input of each feature. After that, it will use
softmax formula to find the probabilities.
We can draw either of them, its upto us.
Now, lets go in more details:-
In the previous example, we discussed an image of a house and a tree with a background. Now, suppose I
have an image of the digit '8' with no background, just the '8' itself. In this case, how will convolution break
the image into smaller parts, and how will pooling select the relevant features?
You have a plain image with the digit "8." Even though it’s a single object with no
background, convolution and pooling still apply effectively. The goal here is to extract
features from the shape and structure of the "8.“
Input:- First it will take the image as an input.
Convolution:-
The image is divided into smaller sections, called pixels, and
each pixel has a value.
• In grayscale images, each pixel has a single intensity value representing shades of gray, typically
ranging from 0 (black) to 255 (white).
• In colorful images, each pixel is represented by three values corresponding to the RGB channels (Red,
Green, and Blue), each with its own intensity ranging from 0 to 255. These values combine to form the
pixel's color.
• How to combine these three values to form a single value:-
Since, we have a black white image, the values of pixel will remain same.
The CNN model will automatically compute the number in each pixel, if the model is dealing with colourful image,
it will automatically generate the formula to convert three numbers into single number.
After getting all the numbers which are present in each pixel, it will automatically generate Kernel.
•In a Convolutional Neural Network (CNN), the kernel values learned during the training process by the model only.
•Initially, the kernel values (weights) are randomly initialized.
•During training, these kernel weights are updated through backpropagation to minimize the error (loss) and better extract
features from the input data.
•Over time, the model "learns" the optimal kernel values that help identify relevant features such as edges, patterns, or other
characteristics in the image.
For small images (like 3x3), use a 2x2 kernel to capture local features efficiently. For larger images, use a 3x3 or 5x5 kernel.
Larger kernels increase computation and may cause overfitting. A 3x3 kernel is common in many CNNs because it balances
feature extraction and efficiency, while 5x5 can be used for more complex patterns but should be used cautiously.
Although the model automatically generate the values of kernel (Weights), we have to define whether we should
implement 2 x 2 or 3 x 3 Kernel. In most cases, 3 x 3 kernel is recommended. As:-
Now lets see, how kernel is used with pixel value to generate Convolved Feature, Activation Map, or Feature Map.
Suppose we have an input image of 4 x 4, and we want to use 2 x2 kernel:-
The behavior of the convolutional layer is primarily governed by us with the following main hyperparameters:
1. Kernel Size: The kernel size is the size of the filter window that slides over the image (e.g., 3x3, 5x5). It defines
how many pixels the kernel looks at during each step of the convolution operation.
2. Stride: Stride refers to how far the kernel moves with each step.
For example, with a stride of 1, the kernel moves one pixel at a time, covering every pixel in the image.
With a stride of 2, the kernel skips every other pixel, making the output smaller and faster to compute.
Strides of 1 are commonly used to prevent underfitting because the kernel will capture more details.
3. Padding: Padding involves adding extra pixels (usually zeros) around the edges of the image to ensure that the
kernel can fully process all parts of the image, including the edges.
•Padding helps preserve the dimensions of the output feature map and allows the kernel to cover all areas of the input
image.
4. Number of Filters/Depth: If you use 3 filters, the convolutional layer will produce 3 different feature maps, each
representing a different characteristic of the image.
After the convolution layer creates the feature map by breaking the image into smaller parts and performing
element-wise multiplication and summation, the resulting feature map is passed to the activation layer (Relu),
where, the edge are analyzed (via pixel value with positive number). ReLU sets all negative values in the feature
map to zero, introducing non-linearity into the model and enabling it to learn complex patterns.
The resulted image is forwarded to pooling layer.
• The goal of the pooling layer is to retain the most relevant information and discard irrelevant details.
• In this scenario, the pooling layer processes the pixels of the feature map directly.
The two primary types of pooling are max pooling and average pooling.
1. Max Pooling:-
Max Pooling (2x2) with Stride 2 in 4 x 4 input:-
Coming to average pooling:-
After seeing both examples of max pooling, here are our observations:-
In most cases, max pooling is chosen because pixels with larger values are considered more relevant. Larger
values indicate stronger activations, which are more important for feature detection. Max pooling ignores weak
activations and focuses on the strongest features.
In contrast, average pooling computes the average by including even noisy or irrelevant values.
Suppose we want to express the output in probability format; in that case, we can either
use sigmoid or softmax. Since we are dealing with one feature, we will use sigmoid as
the activation function in the output.
The model will automatically generate a raw value based on the data it has. Using that value, we can
substitute it into the activation function to obtain the probability. For example, suppose the raw value is 9:-
Till now, we have seen 2 methods to classify an image by implementing CNN.
Till now, in the convolution step, we generate either a 3x3 kernel or a 5x5 kernel, with the values automatically
generated. Let me explain: to generate the kernel in the convolution method, we either use correlation (where the
matrix of the kernel remains as it is) or convolution (where the matrix of the kernel is flipped). Then, we apply the
activation function as usual on the activation map. In max situations, we use convolution only. Even if we use
correlation, we still use convolution simultaneously.
Convolution and Correlation
We primarily use convolution in most cases because of its unique advantages and properties,
especially in the context of Convolutional Neural Networks (CNNs).
Here’s why:-
• 1. Feature Extraction: Convolution flips the kernel, which helps in better capturing patterns like edges, textures, and
other local features in an image.
• 2. Translation Invariance: Convolution ensures that features are detected regardless of their position in the input.
Flipping the kernel helps achieve this by analyzing patterns more comprehensively.
• 3. Efficient Computation: Convolution is computationally efficient as it reduces the dimensions of the input while
preserving essential features. This makes it a preferred choice for feature reduction before applying pooling.
• 4. Edge Detection: Convolution is excellent for edge detection due to the way kernels are flipped and applied. Edges are
crucial in understanding the structure of images, and convolution captures this effectively.
• 5. Compatibility with Activation Functions: After applying convolution, the output is often passed through an
activation function like ReLU to introduce non-linearity, which improves the model’s ability to learn complex patterns.
• 6. Correlation is Less Informative: Correlation does not flip the kernel, so it might miss some spatial relationships that
convolution can capture. While correlation can be used, convolution is generally preferred because of its robustness in
detecting features.
• 7. Standard Practice: Most pre-trained models and standard CNN architectures are designed with convolution layers,
making it a widely adopted practice in machine learning and computer vision.
Let’s see how we use convolution and correlation together (especially for edge detection):-
Suppose, we have an original pic of “8”:-
Pooling Layer
CNN Architecture
So, till now, we saw how:-
Detection and Segmentation
Image Classification
Image ( or Text) Classification and Hyper-parameter tuning.
Advanced CNNs for computer vision.
Deep Learning Unit Two Power Point Presentation
Deep Learning Unit Two Power Point Presentation
Deep Learning Unit Two Power Point Presentation
Deep Learning Unit Two Power Point Presentation

Deep Learning Unit Two Power Point Presentation

  • 1.
  • 2.
  • 5.
    Lets, see theworking flow of CNN based on the image of the flower:-
  • 8.
    Another name ofConvolutional Neural Network is ConvNet.
  • 9.
    Working flow ofCNN in pictorial representation:-
  • 14.
    Now, suppose Iwant to see the output in probability format, then in that case, I will be using an activation function before output. Since I am dealing with house and tree simultaneously, in that case, I will be using softmax. If I am only dealing with a house or tree, then I will be using sigmoid, but in this case, I am dealing with a house and tree, because house and tree are both important feature. So, I will be using softmax here.
  • 15.
    Here, the machinewill automatially calculate the raw input of each feature. After that, it will use softmax formula to find the probabilities.
  • 18.
    We can draweither of them, its upto us.
  • 22.
    Now, lets goin more details:- In the previous example, we discussed an image of a house and a tree with a background. Now, suppose I have an image of the digit '8' with no background, just the '8' itself. In this case, how will convolution break the image into smaller parts, and how will pooling select the relevant features?
  • 23.
    You have aplain image with the digit "8." Even though it’s a single object with no background, convolution and pooling still apply effectively. The goal here is to extract features from the shape and structure of the "8.“ Input:- First it will take the image as an input. Convolution:- The image is divided into smaller sections, called pixels, and each pixel has a value.
  • 24.
    • In grayscaleimages, each pixel has a single intensity value representing shades of gray, typically ranging from 0 (black) to 255 (white). • In colorful images, each pixel is represented by three values corresponding to the RGB channels (Red, Green, and Blue), each with its own intensity ranging from 0 to 255. These values combine to form the pixel's color. • How to combine these three values to form a single value:-
  • 25.
    Since, we havea black white image, the values of pixel will remain same.
  • 26.
    The CNN modelwill automatically compute the number in each pixel, if the model is dealing with colourful image, it will automatically generate the formula to convert three numbers into single number. After getting all the numbers which are present in each pixel, it will automatically generate Kernel. •In a Convolutional Neural Network (CNN), the kernel values learned during the training process by the model only. •Initially, the kernel values (weights) are randomly initialized. •During training, these kernel weights are updated through backpropagation to minimize the error (loss) and better extract features from the input data. •Over time, the model "learns" the optimal kernel values that help identify relevant features such as edges, patterns, or other characteristics in the image. For small images (like 3x3), use a 2x2 kernel to capture local features efficiently. For larger images, use a 3x3 or 5x5 kernel. Larger kernels increase computation and may cause overfitting. A 3x3 kernel is common in many CNNs because it balances feature extraction and efficiency, while 5x5 can be used for more complex patterns but should be used cautiously.
  • 27.
    Although the modelautomatically generate the values of kernel (Weights), we have to define whether we should implement 2 x 2 or 3 x 3 Kernel. In most cases, 3 x 3 kernel is recommended. As:- Now lets see, how kernel is used with pixel value to generate Convolved Feature, Activation Map, or Feature Map.
  • 28.
    Suppose we havean input image of 4 x 4, and we want to use 2 x2 kernel:-
  • 30.
    The behavior ofthe convolutional layer is primarily governed by us with the following main hyperparameters: 1. Kernel Size: The kernel size is the size of the filter window that slides over the image (e.g., 3x3, 5x5). It defines how many pixels the kernel looks at during each step of the convolution operation.
  • 31.
    2. Stride: Striderefers to how far the kernel moves with each step. For example, with a stride of 1, the kernel moves one pixel at a time, covering every pixel in the image. With a stride of 2, the kernel skips every other pixel, making the output smaller and faster to compute. Strides of 1 are commonly used to prevent underfitting because the kernel will capture more details.
  • 32.
    3. Padding: Paddinginvolves adding extra pixels (usually zeros) around the edges of the image to ensure that the kernel can fully process all parts of the image, including the edges. •Padding helps preserve the dimensions of the output feature map and allows the kernel to cover all areas of the input image. 4. Number of Filters/Depth: If you use 3 filters, the convolutional layer will produce 3 different feature maps, each representing a different characteristic of the image.
  • 34.
    After the convolutionlayer creates the feature map by breaking the image into smaller parts and performing element-wise multiplication and summation, the resulting feature map is passed to the activation layer (Relu), where, the edge are analyzed (via pixel value with positive number). ReLU sets all negative values in the feature map to zero, introducing non-linearity into the model and enabling it to learn complex patterns. The resulted image is forwarded to pooling layer. • The goal of the pooling layer is to retain the most relevant information and discard irrelevant details. • In this scenario, the pooling layer processes the pixels of the feature map directly. The two primary types of pooling are max pooling and average pooling.
  • 35.
  • 40.
    Max Pooling (2x2)with Stride 2 in 4 x 4 input:-
  • 43.
  • 45.
    After seeing bothexamples of max pooling, here are our observations:- In most cases, max pooling is chosen because pixels with larger values are considered more relevant. Larger values indicate stronger activations, which are more important for feature detection. Max pooling ignores weak activations and focuses on the strongest features. In contrast, average pooling computes the average by including even noisy or irrelevant values.
  • 48.
    Suppose we wantto express the output in probability format; in that case, we can either use sigmoid or softmax. Since we are dealing with one feature, we will use sigmoid as the activation function in the output. The model will automatically generate a raw value based on the data it has. Using that value, we can substitute it into the activation function to obtain the probability. For example, suppose the raw value is 9:-
  • 50.
    Till now, wehave seen 2 methods to classify an image by implementing CNN.
  • 52.
    Till now, inthe convolution step, we generate either a 3x3 kernel or a 5x5 kernel, with the values automatically generated. Let me explain: to generate the kernel in the convolution method, we either use correlation (where the matrix of the kernel remains as it is) or convolution (where the matrix of the kernel is flipped). Then, we apply the activation function as usual on the activation map. In max situations, we use convolution only. Even if we use correlation, we still use convolution simultaneously. Convolution and Correlation
  • 54.
    We primarily useconvolution in most cases because of its unique advantages and properties, especially in the context of Convolutional Neural Networks (CNNs). Here’s why:- • 1. Feature Extraction: Convolution flips the kernel, which helps in better capturing patterns like edges, textures, and other local features in an image. • 2. Translation Invariance: Convolution ensures that features are detected regardless of their position in the input. Flipping the kernel helps achieve this by analyzing patterns more comprehensively. • 3. Efficient Computation: Convolution is computationally efficient as it reduces the dimensions of the input while preserving essential features. This makes it a preferred choice for feature reduction before applying pooling. • 4. Edge Detection: Convolution is excellent for edge detection due to the way kernels are flipped and applied. Edges are crucial in understanding the structure of images, and convolution captures this effectively. • 5. Compatibility with Activation Functions: After applying convolution, the output is often passed through an activation function like ReLU to introduce non-linearity, which improves the model’s ability to learn complex patterns. • 6. Correlation is Less Informative: Correlation does not flip the kernel, so it might miss some spatial relationships that convolution can capture. While correlation can be used, convolution is generally preferred because of its robustness in detecting features. • 7. Standard Practice: Most pre-trained models and standard CNN architectures are designed with convolution layers, making it a widely adopted practice in machine learning and computer vision.
  • 56.
    Let’s see howwe use convolution and correlation together (especially for edge detection):- Suppose, we have an original pic of “8”:-
  • 59.
  • 66.
    CNN Architecture So, tillnow, we saw how:-
  • 71.
  • 75.
  • 76.
    Image ( orText) Classification and Hyper-parameter tuning.
  • 78.
    Advanced CNNs forcomputer vision.