Deep Learning
Deep Learning
Unit - 7
Concept of Localization
Definition
Localization involves identifying the exact position of one or more objects in an image by placing
bounding boxes or masks around them.
Purpose
While classification labels an entire image, localization focuses on identifying specific areas where
objects are present.
Types of Localization
Algorithms
Description: Directly predicts bounding boxes and class probabilities for multiple objects in a
single pass.
Faster R-CNN
2. A separate network refines these regions for precise classification and localization.
Definition
Regression predicts continuous numerical outputs based on input data. In which we try to find best
fit line which can predict the output more accurately. It is a statistical approach used to analyze the
relationship between a dependent variable and one or more independent variables and it is
supervised learning approach.
Example
Predicting house prices, temperatures, or stock market trends.
Key Characteristics
Objective: The task of regression algo is to map the input variable (x) with continuous
output variable (y) .
Applications
Embeddings
Definition
Examples
1. Word Embeddings (e.g., Word2Vec, GloVe, BERT) capture relationships between words
based on their context in a corpus.
2. Image Embeddings: Map image features into vector spaces for comparison or clustering.
Purpose:
DrLIM creates embeddings that stay consistent (invariant) even if the data is rotated, scaled,
or translated.
It learns to group similar data points together, no matter how they are transformed, making it
easier for models to analyze and work with such data.
Process:
For similar pairs of inputs: Reduces the distance between their embeddings.
For dissimilar pairs: Increases the distance beyond a margin.
Learns relationships between data points, making it especially effective for structured
tasks.
Applications:
Supervised Learning
Clustering and Retrieval
Definition
Inverse problems involve reconstructing inputs (causes) from given outputs (effects).
Instability: Small changes in the output can lead to large variations in the
reconstructed input.
Applications
Example Problems
Definition
In deep learning, traditional methods often work in Euclidean domains, where data lies in flat,
regular spaces like grids (e.g., images or time series). However, many real-world data types, such as
graphs, manifolds, or irregular networks, exist in non-Euclidean domains, which have complex
structures or geometries.
GNNs update a node by using information from its neighbors through the edges.
Applications
RNNs are a type of neural network designed to process sequential data, where the current output
depends on previous computations. They are particularly useful for tasks involving time series,
language, and any data with temporal or sequential relationships. Feedback loops enable the
network to retain memory of previous computations.
Challenges of RNNs
Types of RNNs
Introduces gates to control information flow, mitigating the vanishing gradient problem.
Applications
Deep Learning is a subfield of machine learning inspired by the structure and function of the human
brain, particularly artificial neural networks (ANNs). It uses multi-layered neural networks to model
complex patterns in data.
1. Deep Neural Networks: Utilizes multiple hidden layers between input and output layers.
2. Feature Learning: Automatically extracts and learns features from raw data.
Example:
To classify images of cats and dogs, a convolutional neural network (CNN) learns features like edges,
textures, and higher-level patterns directly from the images without manual feature engineering.
Practical Applications:
1. Image Recognition: Used in facial recognition, medical imaging, and autonomous vehicles.
2. Natural Language Processing (NLP): Powers chatbots, translation tools, and sentiment
analysis.
3. Speech Recognition: Converts spoken words to text, as used in virtual assistants like Alexa
and Siri.
Bayesian Learning
Bayesian Learning is a probabilistic approach to learning based on Bayes' Theorem. It updates the
probability of a hypothesis as more evidence or data becomes available.
Bayes' Theorem: Bayes' Theorem calculates the probability of an event A occurring given that
another event B has occurred
Practical Applications:
Decision Surfaces are boundaries that separate different classes in a feature space. They help in
visualizing and understanding how a model distinguishes between classes.
Characteristics:
Non-Linear Decision Surface: Formed by complex models like Neural Networks and SVMs
(with kernels).
Example:
Practical Applications:
Unit - 2
linear classifier tries to draw a straight line (or a plane/hyperplane) to separate data points belonging
to different classes. These classifiers assume that the decision boundary between different classes
can be represented as a straight line (in 2D), plane (in 3D), or hyperplane (in higher dimensions). It is
widely used in machine learning for classification tasks because of its simplicity, efficiency, and
interpretability.
1. Logistic Regression
Logistic regression is a fundamental statistical model used for binary classification. Logistic
regression is a type of linear classifier, but it models the probability of a data point belonging to a
particular class using the logistic function (sigmoid).It gives probabilities between 0 and 1
Applications:
Support Vector Machines (SVMs) are another popular linear classifier that aims to find the
maximum-margin hyperplane that best separates the data into classes.
How it works:
o The SVM tries to find the hyperplane that maximizes the margin between the two
classes. The margin is the distance between the hyperplane and the nearest data
points from either class (called support vectors).
Key Features:
o SVMs are often used in both linear and non-linear classification tasks. For linear
classification, it directly applies a linear hyperplane. For non-linear classification, the
kernel trick is used to map data into higher dimensions where a linear decision
boundary can be found.
Applications:
o Handwriting recognition
3. Perceptron
The Perceptron is one of the simplest types of neural networks and can be viewed as a linear
classifier. It is part of supervised leaning .
Applications:
Hinge Loss:
Hinge loss, also known as max-margin loss, is a loss function primarily used in Support Vector
Machines (SVMs) for classification tasks. Hinge loss is widely used for binary classification, though it
can be adapted to multi-class classification as well.
1. Correct Classification: The predicted class label should be the same as the true label.
2. Margin: The classifier’s decision boundary should be sufficiently far from the data points of
both classes.
The hinge loss works by penalizing points that are on the wrong side of the margin or are too close to
the decision boundary. The max(0, .) part ensures that the loss is zero when the classifier is correct
and confidently far from the decision boundary
Unit – 6
1. Transformer Architectures: Used in NLP and vision tasks (e.g., BERT, GPT, ViT ) and it use self-
attention mechanisms to focus on important parts of the input.
2. Self-Supervised Learning (SSL): Learning representations without labeled data (e.g., SimCLR,
BYOL).
3. Graph Neural Networks (GNNs): Applied to graph-structured data (e.g., node classification,
recommendation systems).
4. Neural Architecture Search (NAS): It Automated design the neural networks, Instead of
manually tuning architectures it use algorithms to search most optimal model.
5. Federated Learning: Federated learning allows training a model on multiple devices without
sharing raw data, sending only updates to improve the model while maintaining privacy.
6. Neural Radiance Fields (NeRF): NeRF is a neural network-based method for generating 3D
scenes from 2D images.
9. Generative Models and Diffusion Models: Generative and Diffusion Models create high-
quality content, with Diffusion Models often outperforming GANs .
10. Multimodal Learning: Combining different types of data (e.g., CLIP, DALL·E).
Residual Networks (ResNet)
Problem:
Vanishing Gradient
When gradients become too small, earlier layers in the network stop learning, making it hard
for the model to improve
Exploding Gradient
When gradients become too large, the training process becomes unstable and the model
doesn’t work properly.
Traditional deep networks (e.g., VGG) show increased training and testing errors as
their depth increases beyond a certain point.
ResNet solves this issue by introducing Residual Blocks that use skip connections to
bypass some layers, ensuring better gradient flow and learning.
Residual Block
The input x goes through a series of convolution (conv) and ReLU activation,
which gives us F(x).
Then, the result F(x) is added to the original input x. We call this H(x) = F(x) + x.
In regular CNNs, H(x) would just be F(x) (no addition to the input).
Now Instead of learning the full mapping H(x), ResNet forces the network to learn a
residual function F(x):
F(x)=H(x)−x⟹H(x)=F(x)+x
Skip Connection:
o A skip connection (or shortcut) directly connects the input of a residual block
to its output, bypassing one or more layers.
o This bypass allows the gradient to flow directly through the skip connection
during backpropagation, mitigating the vanishing gradient problem.
ResNet Architecture
The ResNet architecture is inspired by earlier networks like VGG-19, but it includes shortcut
connections that transform it into a residual network.
Input ---> [Conv Layer] ---> [Batch Norm] ---> [ReLU] ---> [Conv Layer] ---
> [Batch Norm] ---> (+ Skip Connection) ---> Output
\
___________________________________________________________________________
____________/
ResNet
ResNet is built by stacking residual blocks.
Each residual block has two 3x3 convolution layers.
As we go deeper, the network:
o Doubles the number of filters to capture more details.
o Reduces image size using stride 2 (shrinks height and width).
There’s an extra convolution layer at the start of the network.
No fully connected (FC) layers except the last one, which gives the final output .
ResNet Versions
Comes in depths like 34, 50, 101, or 152 layers for ImageNet tasks.
For deeper networks (ResNet-50 and beyond), it uses a bottleneck layer for
efficiency (similar to GoogLeNet).
In a traditional neural network, each layer is connected to the next one, with the output from
one layer serving as the input to the next. However, this deep architecture can cause problems
like the vanishing gradient problem and difficulty in training very deep networks.
Skip connections are a technique where outputs from earlier layers are passed directly to later
layers, bypassing intermediate layers. These connections help improve gradient flow and
reduce the vanishing gradient problem.
1. Identity Skip Connections: This is the simplest form where the input X is directly
added to the output of the convolutional layers F(X) without any change.
2. Projection Shortcut: If the dimensions of the input and output don't match a
projection (such as a 1x1 convolution) is used to match the dimensions before adding
the skip connection.
3. Bottleneck Architecture: In very deep networks, a "bottleneck" structure is often
used where a 1x1 convolution is applied to reduce the number of features, followed
by a 3x3 convolution, and then another 1x1 convolution to restore the original number
of features.
Input ---> [Conv Layer] ---> [Conv Layer] ---> (+ Skip Connection) --->
Output
\_________________________________/
1. Deeper Networks: Without skip connections, training a very deep network (e.g.,
100+ layers) is often impractical due to problems with vanishing gradients. Skip
connections allow the network to be trained effectively even with hundreds of layers.
2. Better Gradient Flow: The addition of the skip connection makes the network more
stable during backpropagation, as gradients can flow directly through the skip
connection.
3. Improved Performance: Skip connections help the network generalize better and
often lead to improved performance on tasks like image classification and
segmentation.
Fully CNN:
A Fully Connected Convolutional Neural Network (CNN) is a type of neural network that combines
the principles of both Convolutional Neural Networks (CNNs) and Fully Connected (FC) layers. This
architecture is typically used for tasks like image classification, object detection, and segmentation.
Here’s a detailed breakdown of how it works:
1. Convolutional Layers (Feature Extraction)
Convolution is the first step in a CNN where filters (kernels) are applied to the input image.
These filters slide across the image to detect patterns like edges, textures, or colors.
The Convolutional Layer is responsible for extracting features like lines, shapes, or complex
structures by performing the convolution operation.
It’s important to note that filter weights are shared across the entire image, making the
process more efficient than fully connected networks that learn separate weights for each
pixel.
After applying convolution, the output is passed through an activation function, such as
ReLU (Rectified Linear Unit), to introduce non-linearity.
The ReLU function is often used because it’s computationally efficient and helps the model
learn complex patterns.
Pooling layers are used to reduce the dimensionality of the data after each convolution,
which reduces the number of parameters and computation.
The most common pooling method is Max Pooling, which selects the maximum value from a
patch of the image (usually 2x2 or 3x3).
Pooling helps retain the most important features while discarding irrelevant details.
After several convolutional and pooling layers, the data is flattened into a one-dimensional
vector. This step is necessary because the Fully Connected (FC) Layers require input in this
form.
The fully connected layers are traditional dense layers found in any neural network. These
layers connect every neuron to every other neuron in the next layer.
Each neuron in the fully connected layer computes a weighted sum of inputs and applies an
activation function (like ReLU or Softmax) to produce an output.
5. Final Layer and Output
The output layer typically has a number of neurons equal to the number of classes in a
classification problem. For binary classification, one output neuron with a sigmoid activation
might be used. For multi-class classification, a Softmax activation function is common, which
converts the output into probabilities.
In the case of regression tasks, the output layer might contain just one neuron with a linear
activation function.
Disadvantages:
2. High Training Time: Requires significant computational resources for large datasets.
Here’s a detailed comparison of Fully Connected CNN and CNN in a table format:
Used for feature extraction and Combines feature extraction with dense
Purpose classification tasks, often in image (fully connected) layers to refine
recognition. decision-making.
Fully Connected May or may not include fully connected Always includes fully connected layers
Layers layers, depending on the task. after convolutional layers.
Spatial May lose spatial relationships when Preserves spatial features to some extent
Information flattening data for fully connected layers. but still flattens data before dense layers.
Single class label, probability vector, or Single class label or output based on
Output Type
feature representation. fully connected layer decisions.
Common Image classification, object detection, Classification tasks requiring rich feature
Applications and feature extraction. representation or regression tasks.
Advantages Efficient for tasks requiring hierarchical Combines the strengths of convolutional
feature extraction. layers and fully connected layers for
Aspect CNN (Convolutional Neural Network) Fully Connected CNN (FCCNN)
richer outputs.
Unit – 5
A Convolutional Neural Network (CNN) is a special type of deep learning model designed
primarily for analyzing images. CNNs are used in tasks like image recognition, object
detection, and segmentation. They are also important in other fields like autonomous driving,
security systems, and even medical imaging.
CNNs are a form of neural network, but they are unique because they can automatically learn
features from raw image data. Unlike traditional machine learning models that need manual
feature extraction (like identifying edges, shapes, etc.), CNNs can do this automatically,
saving time and improving performance.
They are particularly useful for tasks that involve visual data like recognizing objects in
images. For example:
In self-driving cars, CNNs help the car "see" and recognize objects like pedestrians,
other vehicles, and traffic signs.
In security cameras, CNNs can identify unusual activity or people based on the
patterns they recognize.
2. Inspiration Behind CNN and How They Mimic the Human Brain
CNNs are inspired by the human visual system, specifically the way our brain processes
images. Just like how our brain looks for simple features like lines and curves and combines
them to recognize more complex objects, CNNs do the same through their layers.
1. Convolutional Layers:
o These layers apply filters (small grids of numbers) to the input image, looking
for specific features like edges, shapes, or textures.
o For example, in recognizing a digit like '5', one filter might look for straight
lines, another for curves, etc.
o This process helps the network understand patterns in the image.
2. ReLU Activation Function:
o After convolution, the network uses an activation function called ReLU
(Rectified Linear Unit) to add non-linearity. This helps the network learn
more complex patterns. It also speeds up learning by avoiding problems like
the vanishing gradient (a problem where learning slows down significantly).
3. Pooling Layers:
o Pooling helps reduce the size of the image (feature map), making the network
more efficient and faster.
o Max pooling is commonly used, where the highest value in a grid of pixels is
selected to represent that part of the image. This step reduces the amount of
data the network needs to process.
4. Fully Connected Layers:
o After pooling, the data is flattened (converted into a one-dimensional vector)
and passed through fully connected layers. These layers make the final
prediction, such as classifying the image into a category (e.g., a '5' in digit
recognition).
o The final layer often uses a Softmax function to predict the probabilities of
each class.
Overfitting happens when a model learns the training data too well, including the noise or
random patterns that don’t generalize well to new data. This results in poor performance on
new, unseen data.
Invariance in CNNs
Invariance refers to the property of a model that allows it to recognize an object or pattern in an
image regardless of certain transformations, such as translation (movement), rotation, or scaling. In
other words, if an object in an image is shifted, rotated, or scaled, the CNN should still be able to
identify it correctly.
Translation Invariance:
o Translation invariance means that if an object is shifted (translated) in the image, the
CNN can still detect it. For example, if a car is located at different positions in an
image, the CNN should still be able to identify it as a car, no matter where it appears
in the image.
Rotation Invariance:
o Rotation invariance means that the CNN can recognize an object even if it is rotated
at different angles.
Scale Invariance:
o Scale invariance refers to the ability of the CNN to detect objects regardless of their
size (whether the object is small or large in the image).
Color Invariance:
o This refers to the network's ability to recognize objects regardless of changes in color
(e.g., a red car vs. a blue car).
Stability in CNNs
Stability in CNNs refers to how well the network can handle small changes or
disturbances in the input data. A stable model will give similar outputs when the
inputs are almost the same, even if there are slight changes or noise. This means the
network is robust and not easily affected by small alterations in the data.
Group Formalism
Group formalism is a mathematical framework for understanding the symmetries and invariances of
data .It helps design neural networks that stay consistent under changes like rotations or
translations.
Group: A set of elements (e.g., transformations like rotations, translations) with a defined operation
(e.g., composition) that satisfies certain properties (closure, associativity, identity, inverse).
Equivariance: When the input changes in a certain way (like shifting an image), the output changes in
the same predictable way (the output shifts too).
1. Better Generalization
Neural networks using group formalism are better at handling unseen data because
they understand symmetries in the input.
2. Fewer Data Requirements
Since the model already knows the transformations, it doesn’t need as much training
data to learn them.
3. Improved Interpretability
Group formalism helps reveal patterns in data and the structure of what the model
learns.
Unit – 4
Autoencoders
Autoencoders are a type of unsupervised learning where neural networks are used for
representation learning.
They create a "bottleneck" in the network that forces the data to be compressed,
capturing its most essential features.
Structure
Encoder: Compresses the input into a smaller latent space representation:h = f(x).
Decoder: Reconstructs the input from the latent space: r = g(f(x)).
Loss Function
Measures the difference between the input x and the reconstructed output.
Example: Mean Squared Error (MSE):
Properties of Autoencoders:
1. Data-Specific: Work well only on data like their training set (e.g., trained on cat
images, not tree images).
2. Lossy Compression: Reconstructed outputs may lose some quality, like MP3 or
JPEG.
3. Learned Representation: Automatically learn how to compress and reconstruct data.
Applications of Autoencoders
1. Denoising:
o Train to remove noise from data. Example: Clean noisy images.
2. Image Colorization:
o Convert black-and-white images into colored ones.
3. Watermark Removal:
o Remove watermarks from images or videos.
4. Data Compression:
o Compress data efficiently for specific types, like images or audio.
Types of Autoencoders
1. Standard Autoencoder
Overview:
Limitations:
May overfit if the latent space is too large (memorizes data instead of generalizing).
Struggles with noise in the input data, leading to poor performance on real-world data.
2. Denoising Autoencoder
Benefits:
Adds a penalty to ensure learned representations are robust to small changes in input
data.
Focuses on stability by reducing sensitivity to minor input variations.
Benefits:
1. Bottleneck Layer:
o Reduces the network's capacity by limiting the latent space.
2. Denoising:
o Trains the model to reconstruct clean data from noisy input.
3. Contractive Penalty:
o Penalizes large changes in activations for small changes in input.
1. Comparison Table
Denoising
Feature Standard Autoencoder Contractive Autoencoder
Autoencoder
Input Data Clean Noisy Clean
Noise Handling Poor Excellent Not specifically for noise
Explicit (contractive
Regularization None Implicit via noise
penalty)
Robust feature
Focus Reconstruction Stability in encoding
extraction
Compression, Feature Denoising, Robust Clustering, Semi-
Applications
Extraction Features Supervised Learning
Variational Autoencoders (VAEs) are a type of generative model that combines the concepts
of probabilistic modeling and neural networks.
VAEs are used to generate new data samples similar to the training data, making them
widely applicable in tasks like image generation, anomaly detection, and drug discovery.
1. Standard Autoencoders aim to learn a deterministic mapping of input to a latent space and
back.
2. VAEs instead learn a probabilistic mapping, treating the latent space as a probability
distribution, enabling:
Structure of a VAE
1. Encoder:
o Maps input to a latent space by generating mean (μ) and variance (σ^2).
2. Latent Space:
3. Decoder:
o Reconstructs the input from the sampled latent vector .
Advantages of VAEs
1. Generative Capability:
o VAEs can generate new data by sampling from the latent space.
o Similar inputs map to nearby points in the latent space, making interpolation and
exploration possible.
3. Probabilistic Framework:
Limitations of VAEs
1. Blurred Outputs:
o Generated outputs may be less sharp compared to other generative models like
GANs.
2. KL Divergence Trade-off:
3. Computational Complexity:
Generative No Yes
Generative Adversarial Networks (GANs) are a type of generative model designed to create
new data samples that resemble the training data.
They consist of two neural networks: the Generator and the Discriminator, which are trained
simultaneously in an adversarial manner.
Key Components
1. Generator (G):
o Input: Random noise vector (e.g., sampled from a Gaussian or uniform distribution).
2. Discriminator (D):
o Purpose: To distinguish between real data samples (from the dataset) and fake data
samples (produced by the generator).
3. Adversarial Training:
The Generator tries to create data that the Discriminator cannot distinguish
from real data.
Advantages of GANs
1. High-Quality Outputs:
2. Flexibility:
3. Unsupervised Learning:
o Can learn without explicit labels, making them useful for unstructured data.
Disadvantages of GANs
1. Training Complexity:
2. Mode Collapse:
3. Sensitivity to Hyperparameters:
o GANs require careful tuning of parameters like learning rate, architecture, and loss
function.
Entropy measures the uncertainty or randomness of the distribution. The principle of maximum
entropy states that among all possible distributions that satisfy the given constraints, we should
choose the one with the highest entropy.