Deep Learning Module-03
Deep Learning Module-03
Module-03
Definition
ud
• Optimization: Adjusting model parameters (weights, biases) to minimize the loss
function.
• Loss Function: Measures the error between predicted outputs and actual targets.
• Goal: Find parameters that reduce the error and improve predictions.
lo
C
tu
Key Objective
o Overfitting: Model is too complex, learns noise, performs poorly on new data.
Page 1
21CS743 | DEEP LEARNING
Challenges
ud
o Loss surfaces are complex with many local minima and saddle points.
▪ Local Minima: Points where the loss is low, but not the lowest.
•
lo
Strategies to Overcome Challenges
o Adam, RMSprop: Advanced methods that adapt learning rates during training.
tu
• Regularization Techniques:
neurons.
Page 2
21CS743 | DEEP LEARNING
o Adaptive Methods: Adjust learning rates based on gradient history for stable
training.
ud
Empirical Risk Minimization (ERM)
Concept
• It involves minimizing the average loss on the training data to approximate the true risk
or error on the entire data distribution.
•
lo
The objective of ERM is to train a model that performs well on unseen data by minimizing
the empirical risk derived from the training set.
C
tu
V
Page 3
21CS743 | DEEP LEARNING
Mathematical Formulation
The empirical risk is calculated as the average loss over the training set:
ud
Overfitting vs. Generalization
lo
C
1. Overfitting:
o Occurs when the model performs extremely well on the training data but poorly on
tu
o The model learns the noise and specific patterns in the training set, which do not
generalize.
2. Generalization:
Page 4
21CS743 | DEEP LEARNING
o A generalized model strikes a balance between fitting the training data and
maintaining good performance on the test data.
Regularization Techniques
To combat overfitting and enhance generalization, several regularization techniques are employed:
ud
1.
lo
C
2. Dropout:
training.
o This prevents units from co-adapting too much, forcing the network to learn more
robust features.
o During each training iteration, some neurons are ignored (set to zero), which helps
V
Page 5
21CS743 | DEEP LEARNING
1. Non-Convexity
• Challenges:
o Multiple Local Minima: Loss is low but not the lowest globally.
ud
o Saddle Points: Gradients are zero but not at minima or maxima, causing slow
convergence.
• Visualization: Loss landscape diagrams show complex terrains with hills, valleys, and flat
regions.
Vanishing Gradients:
o
lo
2. Vanishing and Exploding Gradients
• Exploding Gradients:
tu
• Solutions:
V
o Gradient Clipping: Caps gradients to prevent them from becoming too large.
Page 6
21CS743 | DEEP LEARNING
3. Ill-Conditioned Problems
• Impact: Inefficient training, with some parameters updating too quickly or too slowly.
• Solution:
o Normalization Techniques:
ud
▪ Batch Normalization: Normalizes layer inputs for consistent scaling.
•
lo
Concept: Gradient Descent is an optimization algorithm used to minimize a loss function
by updating the model's parameters iteratively.
C
tu
V
Process:
Page 7
21CS743 | DEEP LEARNING
• Concept:
Stochastic Gradient Descent improves upon standard GD by updating the model
parameters using a randomly selected mini-batch of the training data rather than the
entire dataset.
ud
•
• Advantages:
o Faster Updates: Each update is quicker since it uses a small batch of data.
•
o
Challenges:
o
lo
Efficiency: Reduces computational cost, especially for large datasets.
3. Learning Rate
• Definition: The learning rate controls the size of the step taken towards minimizing the
loss during each update.
• Impact:
V
• Strategies:
o Learning Rate Decay: Gradually reduce the learning rate as training progresses.
Page 8
21CS743 | DEEP LEARNING
o Warm Restarts: Periodically reset the learning rate to a higher value to escape
local minima.
4. Momentum
ud
• Update Rule:
•
lo
C
Benefits:
Page 9
21CS743 | DEEP LEARNING
o Poor initialization can lead to gradients that either vanish (become too small) or
ud
explode (become too large), hindering effective learning.
• Accelerates Convergence:
o Ensures that the model starts training with meaningful gradients, leading to
2. Initialization Strategies
lo
efficient optimization.
o Ensures that the variance of the outputs of a layer remains roughly constant across
layers.
V
Page 10
21CS743 | DEEP LEARNING
ud
• Benefits:
o
lo
Balances the scale of gradients flowing in both forward and backward directions.
• Concept:
o Accounts for the fact that ReLU activation outputs are not symmetrically
distributed around zero.
V
Page 11
21CS743 | DEEP LEARNING
ud
•
• Benefits:
o
lo
Prevents the dying ReLU problem (where neurons output zero for all inputs).
3. Practical Impact
C
• Faster Convergence:
o Proper initialization provides a good starting point for optimization, reducing the
tu
o Empirical studies show that networks with proper initialization not only converge
faster but also achieve better final accuracy.
V
Page 12
21CS743 | DEEP LEARNING
1. Motivation
o Fixed learning rates can be ineffective as they do not account for the varying
characteristics of different layers or the nature of the training data.
ud
o Certain parameters may require larger updates, while others may need smaller
adjustments. Adaptive learning rates enable the model to adjust learning based on
the training dynamics.
2. AdaGrad
• Concept:
o
lo
AdaGrad (Adaptive Gradient Algorithm) adapts the learning rate for each
parameter based on the past gradients. It increases the learning rate for infrequent
features and decreases it for frequent features, making it particularly effective for
C
sparse data scenarios.
tu
V
Page 13
21CS743 | DEEP LEARNING
• Advantages:
o Good for Sparse Data: AdaGrad performs well in scenarios where features have
varying frequencies, such as in natural language processing tasks.
ud
• Challenges:
o Rapid Learning Rate Decay: The learning rate can decrease too quickly, leading
to premature convergence and potentially suboptimal solutions.
3. RMSProp
• Concept:
o
lo
RMSProp (Root Mean Square Propagation) improves upon AdaGrad by using a
moving average of squared gradients, addressing the rapid decay issue of
AdaGrad's learning rate.
C
tu
•
Advantages:
V
Page 14
21CS743 | DEEP LEARNING
1. Factors to Consider
ud
• Data Size:
o Large datasets may require optimization algorithms that can handle more frequent
updates (e.g., SGD or mini-batch variants).
o Smaller datasets may benefit from adaptive methods that adjust learning rates (e.g.,
• Model Complexity:
o
lo
AdaGrad or Adam).
Complex models (deep networks) can benefit from algorithms that adjust learning
C
rates dynamically (e.g., RMSProp or Adam) to navigate complex loss surfaces
effectively.
• Computational Resources:
o Resource availability may dictate the choice of algorithm. Some algorithms (e.g.,
Adam) are more computationally intensive due to maintaining additional state
information (like momentum and moving averages).
V
o Cons: Requires careful tuning of learning rates and may converge slowly.
Page 15
21CS743 | DEEP LEARNING
• AdaGrad:
o Pros: Adapts learning rates based on parameter frequency; effective for sparse data.
o Cons: Tends to slow down learning too quickly due to rapid decay of learning rates.
• RMSProp:
ud
in non-stationary problems.
o Pros: Combines momentum with adaptive learning rates; generally performs well
o
lo
across a wide range of tasks and is robust to hyperparameter settings.
Cons: More complex to implement and requires careful tuning for optimal
performance.
C
3. Practical Tips
o For most tasks, beginning with the Adam optimizer is recommended due to its
tu
o Experiment with different learning rates to find the best fit for your specific model
V
and data. A common approach is to perform a learning rate search or use techniques
like cyclical learning rates.
Page 16
21CS743 | DEEP LEARNING
• Objective:
ud
terms of learning curves, loss, and accuracy.
• Dataset:
o CIFAR-10 consists of 60,000 32x32 color images in 10 classes, with 6,000 images
per class. The classes include airplanes, cars, birds, cats, deer, dogs, frogs, horses,
and trucks.
• Model Architecture:
o
lo
Use a simple CNN architecture with convolutional layers, ReLU activation, pooling
layers, and a fully connected output layer.
C
• Training Process:
o Implement two training runs: one using SGD and the other using RMSProp.
tu
o Hyperparameters:
▪ Learning Rate: Set initial values (e.g., 0.01 for SGD, 0.001 for RMSProp).
• Comparison Metrics:
o Learning Curves: Plot training and validation accuracy and loss over epochs for
both optimizers.
Page 17
21CS743 | DEEP LEARNING
o Loss and Accuracy: Analyze final training and validation loss and accuracy after
training completion.
• Expected Results:
ud
2. NLP Task with RNN/Transformer
• Objective:
• Dataset:
o
lo
Use a text dataset such as IMDB reviews for sentiment analysis or any sequence
data suitable for RNNs or Transformers.
C
• Model Architecture:
o Include layers such as LSTM or GRU for RNNs, or attention mechanisms for
Transformers.
• Training Process:
V
o Hyperparameters:
▪ Learning Rates: Start with different learning rates for each optimizer.
Page 18
21CS743 | DEEP LEARNING
ud
• Comparison Metrics:
o Loss Curves: Visualize the loss curves for each optimizer to show convergence
behavior.
o Training Performance: Analyze the final training and validation accuracy and
loss.
• Expected Results:
o
lo
RMSProp and AdaGrad may show better performance than SGD, particularly in
tasks where the data is sparse or where gradients can vanish, leading to slower
C
convergence.
tu
V
Page 19
21CS743 | DEEP LEARNING
3. Visualization
• Loss Curves:
o Plot the training and validation loss curves for each optimizer used in both case
studies. This visualization will demonstrate:
ud
loss value.
▪ Stability: The stability of loss reduction over time and the presence of
fluctuations.
• Learning Curves:
o Include plots of training and validation accuracy over epochs for visual comparison
lo
of model performance across different optimizers.
C
tu
V
Page 20