UNIT-IV: Improving Deep Neural Networks: Data Augmentation - Under-fitting vs.
over-fitting. Training Aspects of CNNs, Regularization, Weight Initialization, Activation
Functions, Normalization, Hyperparameters in CNNs, Transfer Learning, and Fine Tuning in
CNNS
Bias
● Bias is the error from wrong assumptions in the model.
● A high-bias model is too simple to learn the true patterns.
● It underfits the data (poor performance on both training and test sets).
📌 Example: Using a straight line to fit curved data.
Variance
● Variance is the error from the model being too sensitive to training data.
● A high-variance model memorizes the training data and overfits.
● It performs well on training data but poorly on new data.
📌 Example: A deep neural net trained on very few images.
Bias-Variance Trade-off
● You need to balance bias and variance to build a good model.
○ Too much bias → underfitting.
○ Too much variance → overfitting.
● The goal is to find a sweet spot where the model generalizes well to unseen data.
Student Analogy – Understanding Bias & Variance
Student Behavior Outcome in Outcome in Conclusion
Class Test Monthly Test
A Distracted, pays no ~50% (random ~50% Underfitting
attention guessing)
B Memorizes 98% Low score Overfitting
everything
C Learns concepts & 92% Consistent score Good
solves problems Generalization
➡️ Similar to:
● A = Simple model (can’t learn)
● B = Overly complex model (memorizes)
● C = Balanced model (generalizes well)
Underfitting
● The model is too simple to learn the underlying pattern in the data.
● Occurs due to high bias.
● Poor performance on both training and test data.
● Fails to capture important trends in the data.
● Causes: Simple model, insufficient training, too few parameters.
● Example: Using a linear model to classify complex image data.
(1-layer CNN on MNIST)
Overfitting
● The model is too complex and learns noise along with patterns.
● Occurs due to high variance.
● Performs very well on training data, but poorly on test data.
● Fails to generalize to unseen data.
● Causes: Complex model, small training dataset, too many parameters.
● Example: Deep neural network memorizing a small set of images(Deep CNN on 100
cat images)
Real-Life Analogy: Fruit Recognition
● Underfitting: Seen only one apple & banana → can't identify new fruit.
● Overfitting: Memorized exact training images → fails on rotated banana.
● Good Fit: Seen various types → generalizes to unseen examples.
Data Augmentation
● Data augmentation is a technique used to artificially increase the size and
diversity of the training dataset.
● It involves applying various transformations to the original data to create new,
modified examples.
● Helps improve model generalization and prevents overfitting, especially when the
dataset is small.
● Commonly used in training Convolutional Neural Networks (CNNs) for image
classification tasks.
Common Techniques of Data Augmentation
1. Flipping – Horizontally or vertically flipping the image.
2. Rotation – Rotating the image by a small angle (e.g., ±10°).
3. Scaling – Zooming in or out of the image.
4. Cropping – Randomly cutting parts of the image.
5. Brightness/Contrast Adjustment – Modifying lighting conditions.
6. Translation – Shifting the image left, right, up, or down.
7. Noise Injection – Adding random noise to make the model robust.
Purpose:
● Prevents overfitting
● Exposes model to variations
● Encourages robust feature learning
Training Aspects of CNNs:
1. Regularization Techniques
Purpose: Prevents overfitting by limiting the complexity of the model.
1. Dropout
○ Definition: Randomly disables a percentage of neurons during training.
○ Goal: Prevents overfitting by ensuring that the model doesn't rely too much
on any single neuron.
2. L2 Regularization (Weight Decay)
○ Definition: Adds a penalty term to the loss function based on the squared
values of the weights.
○ Goal: Prevents the model from learning excessively large weights, which
could lead to overfitting.
○ Formula: where λ is the regularization parameter.
3. L1 Regularization
○ Definition: Adds a penalty term to the loss function based on the absolute
values of the weights.
○ Goal: Encourages sparsity in the model (many weights become zero), leading
to simpler models.
○ Formula:
4. Data Augmentation
○ Definition: Artificially increases the size of the training dataset by applying
random transformations to the images (e.g., flipping, rotation, scaling,
cropping).
○ Goal: Increases the variety of data the model sees, preventing it from
memorizing the training examples and improving generalization.
2. Weight Initialization Techniques in CNNs
1. Zero Initialization
○ Definition: All weights are initialized to zero.
○ Problem: Leads to symmetry-breaking failure—every neuron in the layer
learns the same thing, making the network ineffective.
○ Conclusion: Not used in practice.
2. Random Initialization
○ Definition: Weights are initialized with small random values drawn from a
normal or uniform distribution.
○ Problem: If the values are too small, it can lead to vanishing gradients; if
too large, it can lead to exploding gradients.
3. Normal Distribution Initialization
○ Definition: Weights are drawn from a normal (Gaussian) distribution,
centered around a mean (usually 0) with a certain standard deviation.
○ Example: torch.nn.init.normal_(tensor, mean=0.0, std=0.02)
○ Advantage: Helps in maintaining variance in the model and prevents both
vanishing and exploding gradients.
○
4. Uniform Distribution Initialization
○ Definition: Weights are initialized using a uniform distribution where all
values within a range are equally likely.
○ Example: torch.nn.init.uniform_(tensor, a=-0.1, b=0.1)
○ Advantage: Ensures that all neurons start with different values, which is
important for symmetry breaking.
5. Xavier Initialization (Glorot Initialization)
○ Used For: Sigmoid and tanh activation functions.
○ Definition: Weights are initialized with values that maintain the variance
across layers. The goal is to keep the variance of activations and gradients
the same across layers.
○ Formula:
■ For Uniform Distribution:
■ For Normal Distribution:
○ Advantage: Helps in preventing vanishing/exploding gradients.
6. He Initialization (Kaiming Initialization)
○ Used For: ReLU and its variants (Leaky ReLU, ELU).
○ Definition: Weights are initialized with higher variance to account for the fact
that ReLU neurons "kill" half the activations (negative ones).
○ Formula:
○ Advantage: Keeps the variance large enough to prevent dying ReLU
problems.
3. Activation Functions
Sigmoid:
● Formula:
● Range: 0 to 1
● Use: Binary classification
● Issue: Vanishing gradients.
Tanh:
● Range: -1 to 1
● Use: Hidden layers
● Issue: Vanishing gradients.
ReLU:
● Formula: ReLU(x)=max(0,x)
● Range: 0 to ∞
● Use: Hidden layers
● Issue: Dying ReLU problem.
Leaky ReLU:
● Formula: Leaky ReLU(x)=max(0.01x,x)
● Range: (-∞ to ∞)
● Use: Hidden layers, solves dying ReLU.
Softmax:
● Formula:
● Range: 0 to 1 (probabilities)
● Use: Multi-class classification.
4. Normalization refers to the technique of adjusting the input data or activations in neural
networks so that they lie within a certain range, helping to improve the efficiency and stability
of the training process.
Common Normalization Techniques:
1. Batch Normalization (BN):
○ Purpose: Reduces internal covariate shift by normalizing activations for each
mini-batch.
○ How: For each layer, normalize the input features to have zero mean and unit
variance.
○ Benefits: Speeds up training, allows higher learning rates, reduces
overfitting.
5. Loss Function
● Purpose: Measures how far the model’s predictions are from the true labels.
● Examples:
○ Cross-Entropy Loss: For classification.
○ Mean Squared Error (MSE): For regression.
6. Optimizer
● Purpose: Adjusts the weights based on the gradient of the loss function.
● Common Optimizers:
○ Adam
○ SGD (Stochastic Gradient Descent)
○ RMSprop
7. Learning Rate
● Purpose: Controls the step size when updating the weights.
● Impact:
○ Too high → Can lead to unstable learning.
○ Too low → Training can be very slow and might get stuck in suboptimal
solutions.
8.Early Stopping
● Purpose: Stops training when the performance on the validation set stops improving.
● Goal: Prevents overfitting by halting training before the model memorizes the training
data.
9. Hyperparameters in CNNs
Hyperparameters are parameters set before training that greatly affect how well a CNN
learns and performs. Tuning them properly is essential for optimal model performance.
1. Number of Layers
● Defines the depth of the CNN (how many convolutional + pooling layers).
● Deeper networks can learn more complex features.
● But too deep → risk of overfitting.
● Start with fewer layers and increase gradually based on performance.
2. Filter Size
● Determines the receptive field (how much of the image is seen by a neuron).
● Larger filters (e.g., 5x5) → capture more spatial info, but more parameters.
● Smaller filters (e.g., 3x3) → fewer parameters, but might miss global features.
● Common practice: use 3x3 filters in most modern CNNs.
3. Stride
● Controls how much the filter moves across the input.
● Stride = 1 (default) → more detailed output.
● Stride > 1 → reduces feature map size (downsampling).
● Trade-off: smaller stride = more info, larger stride = faster but info loss.
4. Padding
● Adds zeros around the input to preserve its size after convolution.
● Types:
○ Same padding: output size = input size.
○ Valid padding: no padding; output shrinks.
● Helps retain edge information.
● Slightly increases computation and memory use.
5. Learning Rate
● Controls the step size during weight updates.
● High learning rate → fast but unstable learning.
● Low learning rate → stable but slow learning.
● Needs fine-tuning; often adjusted dynamically during training.
6. Batch Size
● Number of training samples processed before the model updates weights.
● Large batch size:
○ Stable gradients.
○ Higher memory usage.
● Small batch size:
○ Lower memory.
○ Faster updates, but noisier gradients.
● Common values: 16, 32, 64, 128 (depending on GPU capacity).
7. Number of Epochs
● One epoch = one full pass over the entire dataset.
● Too few epochs → underfitting (model hasn’t learned enough).
● Too many epochs → overfitting (model memorizes training data).
● Use early stopping to halt training when validation performance stops improving.
8. Dropout Rate
● Regularization method to prevent overfitting.
● Randomly "drops" a fraction of neurons during training.
● Common dropout rates: 0.2 to 0.5.
● Helps ensure neurons don’t become overly reliant on each other (co-adaptation).
Hyperparameter Optimization Techniques
These techniques help in finding the best set of hyperparameters for training a CNN
effectively.
1. Grid Search
● Definition: Exhaustively tries all possible combinations of hyperparameters from a
predefined set (grid).
● ✅ Simple and systematic.
● ❌ Computationally expensive, especially with many parameters or large datasets.
2. Random Search
● Definition: Randomly samples a fixed number of hyperparameter combinations from
defined ranges.
● ✅ More efficient than grid search in high-dimensional spaces.
● ✅ Often finds good results faster.
● ❌ May miss optimal combinations.
3. Bayesian Optimization
● Definition: Builds a probabilistic model of the objective function and uses it to
choose the next set of hyperparameters.
● ✅ Smart and efficient search.
● ✅ Focuses on promising regions of the parameter space.
● ❌ More complex to implement.
Transfer Learning with Convolutional Neural Networks (CNNs)
Transfer learning allows a CNN trained on a large dataset (like ImageNet) to be reused for a
related but different task, saving time, computation, and data.
Why Use Transfer Learning?
● Pre-trained CNNs learn general visual features (edges, textures, shapes).
● Saves time and resources.
● Improves performance on small datasets.
● Reduces the need for extensive training.
Key Idea:
● Use pre-trained CNN as a feature extractor.
● Freeze pre-trained layers (do not update them).
● Add new task-specific layers on top and train them on your dataset.
Common Pre-trained Models:
● VGG
● ResNet
● Inception
● MobileNet
(Available in TensorFlow, PyTorch, etc.)
Steps to Implement Transfer Learning
1. Select Pre-trained Model
○ Choose a model suited to your problem (e.g., image classification, detection).
2. Load Model without Top Layers
○ Remove the original fully-connected layers (used for previous task).
3. Customize the Model
○ Add your own layers (e.g., Dense, Dropout, Softmax) for the new
classification task.
4. Freeze Pre-trained Layers
○ Prevent updates to these layers during training.
5. Prepare the Dataset
○ Resize, normalize, and augment images to match the input format of the
model.
6. Train the Model
○ Only newly added layers are trained. Use appropriate optimizer and loss
function.
7. Fine-tune (Optional)
○ Unfreeze some top layers of the pre-trained model and re-train with a low
learning rate.
8. Evaluate the Model
○ Use validation/test data. Assess performance using metrics like accuracy,
loss, precision, or recall.
Fine-tuning Strategies in CNNs (Transfer Learning)
1. Freezing Layers
● Purpose: Preserve previously learned features.
● How:
○ Freeze early layers (low-level features like edges/textures).
○ Optionally freeze all layers except the final one.
○ Only train the last few or new layers.
2. Modifying Layers
● Output Layer:
○ Replace if the number of classes is different from the original task.
● Input Layer:
○ Adjust if the input feature size or shape has changed.
● Add New Layers:
○ Add custom layers (Dense, Dropout, etc.) and train only these initially.
3. Adjusting the Learning Rate
● Use a smaller learning rate during fine-tuning:
○ Allows gradual adaptation.
○ Prevents drastic changes to useful learned features.
● Typical strategy:
○ Use 1/10th of the original learning rate.
○ Example:
■ Original task: lr = 0.01
■ Fine-tuning: lr = 0.001
Why are Hyperparameters Important in CNNs?
● Influence training speed – Determines how fast the model learns.
● Affect model accuracy – Key to achieving high prediction performance.
● Control overfitting and underfitting – Help balance model complexity.
● Crucial for generalization – Ensure good performance on unseen/test data.
● Enable faster convergence – Reduce training time by optimizing learning steps.
● Improve overall model performance – Well-tuned settings lead to better results.
Example: CNN Hyperparameter Setup # Sample CNN Hyperparameters
learning_rate = 0.001
batch_size = 32
epochs = 50
optimizer = 'Adam'
num_filters = [32, 64, 128]
filter_size = (3, 3)
stride = 1
padding = 'same'
activation = 'ReLU'
dropout_rate = 0.5