0% found this document useful (0 votes)
65 views17 pages

UNIT-IV Improving Deep Neural Networks

The document discusses the concepts of bias and variance in deep learning, highlighting the importance of balancing them to avoid underfitting and overfitting. It also covers various training aspects of Convolutional Neural Networks (CNNs), including data augmentation, regularization techniques, weight initialization, activation functions, normalization, and hyperparameter tuning. Additionally, it explains transfer learning and fine-tuning strategies to enhance model performance on new tasks using pre-trained models.

Uploaded by

preetam naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views17 pages

UNIT-IV Improving Deep Neural Networks

The document discusses the concepts of bias and variance in deep learning, highlighting the importance of balancing them to avoid underfitting and overfitting. It also covers various training aspects of Convolutional Neural Networks (CNNs), including data augmentation, regularization techniques, weight initialization, activation functions, normalization, and hyperparameter tuning. Additionally, it explains transfer learning and fine-tuning strategies to enhance model performance on new tasks using pre-trained models.

Uploaded by

preetam naik
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

UNIT-IV: Improving Deep Neural Networks: Data Augmentation - Under-fitting vs.

over-fitting. Training Aspects of CNNs, Regularization, Weight Initialization, Activation


Functions, Normalization, Hyperparameters in CNNs, Transfer Learning, and Fine Tuning in
CNNS

Bias

●​ Bias is the error from wrong assumptions in the model.​

●​ A high-bias model is too simple to learn the true patterns.​

●​ It underfits the data (poor performance on both training and test sets).​

📌 Example: Using a straight line to fit curved data.


Variance

●​ Variance is the error from the model being too sensitive to training data.​

●​ A high-variance model memorizes the training data and overfits.​

●​ It performs well on training data but poorly on new data.​

📌 Example: A deep neural net trained on very few images.


Bias-Variance Trade-off

●​ You need to balance bias and variance to build a good model.​

○​ Too much bias → underfitting.​

○​ Too much variance → overfitting.​

●​ The goal is to find a sweet spot where the model generalizes well to unseen data.​

Student Analogy – Understanding Bias & Variance

Student Behavior Outcome in Outcome in Conclusion


Class Test Monthly Test

A Distracted, pays no ~50% (random ~50% Underfitting


attention guessing)

B Memorizes 98% Low score Overfitting


everything
C Learns concepts & 92% Consistent score Good
solves problems Generalization

➡️ Similar to:
●​ A = Simple model (can’t learn)​

●​ B = Overly complex model (memorizes)​

●​ C = Balanced model (generalizes well)​

Underfitting

●​ The model is too simple to learn the underlying pattern in the data.​

●​ Occurs due to high bias.​

●​ Poor performance on both training and test data.​

●​ Fails to capture important trends in the data.​

●​ Causes: Simple model, insufficient training, too few parameters.​

●​ Example: Using a linear model to classify complex image data.

(1-layer CNN on MNIST)​

Overfitting

●​ The model is too complex and learns noise along with patterns.​

●​ Occurs due to high variance.​

●​ Performs very well on training data, but poorly on test data.​

●​ Fails to generalize to unseen data.​

●​ Causes: Complex model, small training dataset, too many parameters.​

●​ Example: Deep neural network memorizing a small set of images(Deep CNN on 100
cat images)

Real-Life Analogy: Fruit Recognition

●​ Underfitting: Seen only one apple & banana → can't identify new fruit.​
●​ Overfitting: Memorized exact training images → fails on rotated banana.​

●​ Good Fit: Seen various types → generalizes to unseen examples.

Data Augmentation

●​ Data augmentation is a technique used to artificially increase the size and


diversity of the training dataset.​

●​ It involves applying various transformations to the original data to create new,


modified examples.​

●​ Helps improve model generalization and prevents overfitting, especially when the
dataset is small.​
●​ Commonly used in training Convolutional Neural Networks (CNNs) for image
classification tasks.​

Common Techniques of Data Augmentation

1.​ Flipping – Horizontally or vertically flipping the image.​

2.​ Rotation – Rotating the image by a small angle (e.g., ±10°).​

3.​ Scaling – Zooming in or out of the image.​

4.​ Cropping – Randomly cutting parts of the image.​

5.​ Brightness/Contrast Adjustment – Modifying lighting conditions.​

6.​ Translation – Shifting the image left, right, up, or down.​

7.​ Noise Injection – Adding random noise to make the model robust.


Purpose:

●​ Prevents overfitting​

●​ Exposes model to variations​

●​ Encourages robust feature learning​

Training Aspects of CNNs:

1.​ Regularization Techniques


Purpose: Prevents overfitting by limiting the complexity of the model.
1.​ Dropout​

○​ Definition: Randomly disables a percentage of neurons during training.​

○​ Goal: Prevents overfitting by ensuring that the model doesn't rely too much
on any single neuron.​

2.​ L2 Regularization (Weight Decay)​

○​ Definition: Adds a penalty term to the loss function based on the squared
values of the weights.​

○​ Goal: Prevents the model from learning excessively large weights, which
could lead to overfitting.​

○​ Formula: where λ is the regularization parameter.​

3.​ L1 Regularization​

○​ Definition: Adds a penalty term to the loss function based on the absolute
values of the weights.​

○​ Goal: Encourages sparsity in the model (many weights become zero), leading
to simpler models.​

○​ Formula:​

4.​ Data Augmentation​

○​ Definition: Artificially increases the size of the training dataset by applying


random transformations to the images (e.g., flipping, rotation, scaling,
cropping).​

○​ Goal: Increases the variety of data the model sees, preventing it from
memorizing the training examples and improving generalization.​

2. Weight Initialization Techniques in CNNs

1.​ Zero Initialization​

○​ Definition: All weights are initialized to zero.​

○​ Problem: Leads to symmetry-breaking failure—every neuron in the layer


learns the same thing, making the network ineffective.​
○​ Conclusion: Not used in practice.​

2.​ Random Initialization​

○​ Definition: Weights are initialized with small random values drawn from a
normal or uniform distribution.​

○​ Problem: If the values are too small, it can lead to vanishing gradients; if
too large, it can lead to exploding gradients.​

3.​ Normal Distribution Initialization​

○​ Definition: Weights are drawn from a normal (Gaussian) distribution,


centered around a mean (usually 0) with a certain standard deviation.​

○​ Example: torch.nn.init.normal_(tensor, mean=0.0, std=0.02)​

○​ Advantage: Helps in maintaining variance in the model and prevents both


vanishing and exploding gradients.
○​ ​

4.​ Uniform Distribution Initialization​

○​ Definition: Weights are initialized using a uniform distribution where all


values within a range are equally likely.​

○​ Example: torch.nn.init.uniform_(tensor, a=-0.1, b=0.1)​

○​ Advantage: Ensures that all neurons start with different values, which is
important for symmetry breaking.

5.​ Xavier Initialization (Glorot Initialization)​

○​ Used For: Sigmoid and tanh activation functions.​

○​ Definition: Weights are initialized with values that maintain the variance
across layers. The goal is to keep the variance of activations and gradients
the same across layers.​

○​ Formula:​

■​ For Uniform Distribution:

■​ For Normal Distribution:​

○​ Advantage: Helps in preventing vanishing/exploding gradients.​

6.​ He Initialization (Kaiming Initialization)​

○​ Used For: ReLU and its variants (Leaky ReLU, ELU).​

○​ Definition: Weights are initialized with higher variance to account for the fact
that ReLU neurons "kill" half the activations (negative ones).​

○​ Formula:​

○​ Advantage: Keeps the variance large enough to prevent dying ReLU


problems.

3. Activation Functions

Sigmoid:​

●​ Formula: ​

●​ Range: 0 to 1​

●​ Use: Binary classification​

●​ Issue: Vanishing gradients.


Tanh:

●​ Range: -1 to 1​

●​ Use: Hidden layers​

●​ Issue: Vanishing gradients.​

ReLU:​

●​ Formula: ReLU(x)=max(0,x)​

●​ Range: 0 to ∞​

●​ Use: Hidden layers​

●​ Issue: Dying ReLU problem.​

Leaky ReLU:​

●​ Formula: Leaky ReLU(x)=max(0.01x,x)​

●​ Range: (-∞ to ∞)​

●​ Use: Hidden layers, solves dying ReLU.​


Softmax:

●​ Formula: ​

●​ Range: 0 to 1 (probabilities)​

●​ Use: Multi-class classification.​

4. Normalization refers to the technique of adjusting the input data or activations in neural
networks so that they lie within a certain range, helping to improve the efficiency and stability
of the training process.

Common Normalization Techniques:


1.​ Batch Normalization (BN):​

○​ Purpose: Reduces internal covariate shift by normalizing activations for each


mini-batch.​

○​ How: For each layer, normalize the input features to have zero mean and unit
variance.​

○​ Benefits: Speeds up training, allows higher learning rates, reduces


overfitting.​

5. Loss Function

●​ Purpose: Measures how far the model’s predictions are from the true labels.​

●​ Examples:​

○​ Cross-Entropy Loss: For classification.​

○​ Mean Squared Error (MSE): For regression.​

6. Optimizer

●​ Purpose: Adjusts the weights based on the gradient of the loss function.​

●​ Common Optimizers:​

○​ Adam​

○​ SGD (Stochastic Gradient Descent)​

○​ RMSprop​

7. Learning Rate

●​ Purpose: Controls the step size when updating the weights.​

●​ Impact:​

○​ Too high → Can lead to unstable learning.​

○​ Too low → Training can be very slow and might get stuck in suboptimal
solutions.​
8.Early Stopping

●​ Purpose: Stops training when the performance on the validation set stops improving.​

●​ Goal: Prevents overfitting by halting training before the model memorizes the training
data.

9. Hyperparameters in CNNs

Hyperparameters are parameters set before training that greatly affect how well a CNN
learns and performs. Tuning them properly is essential for optimal model performance.

1. Number of Layers

●​ Defines the depth of the CNN (how many convolutional + pooling layers).​

●​ Deeper networks can learn more complex features.​

●​ But too deep → risk of overfitting.​

●​ Start with fewer layers and increase gradually based on performance.

2. Filter Size

●​ Determines the receptive field (how much of the image is seen by a neuron).​

●​ Larger filters (e.g., 5x5) → capture more spatial info, but more parameters.​

●​ Smaller filters (e.g., 3x3) → fewer parameters, but might miss global features.​

●​ Common practice: use 3x3 filters in most modern CNNs.


3. Stride

●​ Controls how much the filter moves across the input.​

●​ Stride = 1 (default) → more detailed output.​

●​ Stride > 1 → reduces feature map size (downsampling).​

●​ Trade-off: smaller stride = more info, larger stride = faster but info loss.

4. Padding

●​ Adds zeros around the input to preserve its size after convolution.​

●​ Types:​

○​ Same padding: output size = input size.​

○​ Valid padding: no padding; output shrinks.​

●​ Helps retain edge information.​

●​ Slightly increases computation and memory use.


5. Learning Rate

●​ Controls the step size during weight updates.​

●​ High learning rate → fast but unstable learning.​

●​ Low learning rate → stable but slow learning.​

●​ Needs fine-tuning; often adjusted dynamically during training.​

6. Batch Size

●​ Number of training samples processed before the model updates weights.​

●​ Large batch size:​

○​ Stable gradients.​

○​ Higher memory usage.​

●​ Small batch size:​

○​ Lower memory.​

○​ Faster updates, but noisier gradients.​

●​ Common values: 16, 32, 64, 128 (depending on GPU capacity).​


7. Number of Epochs

●​ One epoch = one full pass over the entire dataset.​

●​ Too few epochs → underfitting (model hasn’t learned enough).​

●​ Too many epochs → overfitting (model memorizes training data).​

●​ Use early stopping to halt training when validation performance stops improving.​

8. Dropout Rate

●​ Regularization method to prevent overfitting.​

●​ Randomly "drops" a fraction of neurons during training.​

●​ Common dropout rates: 0.2 to 0.5.​

●​ Helps ensure neurons don’t become overly reliant on each other (co-adaptation).​

Hyperparameter Optimization Techniques

These techniques help in finding the best set of hyperparameters for training a CNN
effectively.

1. Grid Search

●​ Definition: Exhaustively tries all possible combinations of hyperparameters from a


predefined set (grid).​

●​ ✅ Simple and systematic.​


●​ ❌ Computationally expensive, especially with many parameters or large datasets.​
2. Random Search

●​ Definition: Randomly samples a fixed number of hyperparameter combinations from


defined ranges.​

●​ ✅ More efficient than grid search in high-dimensional spaces.​


●​ ✅ Often finds good results faster.​
●​ ❌ May miss optimal combinations.​
3. Bayesian Optimization

●​ Definition: Builds a probabilistic model of the objective function and uses it to


choose the next set of hyperparameters.​

●​ ✅ Smart and efficient search.​


●​ ✅ Focuses on promising regions of the parameter space.​
●​ ❌ More complex to implement.​
Transfer Learning with Convolutional Neural Networks (CNNs)

Transfer learning allows a CNN trained on a large dataset (like ImageNet) to be reused for a
related but different task, saving time, computation, and data.

Why Use Transfer Learning?

●​ Pre-trained CNNs learn general visual features (edges, textures, shapes).​

●​ Saves time and resources.​

●​ Improves performance on small datasets.​

●​ Reduces the need for extensive training.​

Key Idea:

●​ Use pre-trained CNN as a feature extractor.​

●​ Freeze pre-trained layers (do not update them).​

●​ Add new task-specific layers on top and train them on your dataset.​

Common Pre-trained Models:

●​ VGG​

●​ ResNet​

●​ Inception​

●​ MobileNet​
(Available in TensorFlow, PyTorch, etc.)​
Steps to Implement Transfer Learning

1.​ Select Pre-trained Model​

○​ Choose a model suited to your problem (e.g., image classification, detection).​

2.​ Load Model without Top Layers​

○​ Remove the original fully-connected layers (used for previous task).​

3.​ Customize the Model​

○​ Add your own layers (e.g., Dense, Dropout, Softmax) for the new
classification task.​

4.​ Freeze Pre-trained Layers​

○​ Prevent updates to these layers during training.​

5.​ Prepare the Dataset​

○​ Resize, normalize, and augment images to match the input format of the
model.​

6.​ Train the Model​

○​ Only newly added layers are trained. Use appropriate optimizer and loss
function.​

7.​ Fine-tune (Optional)​

○​ Unfreeze some top layers of the pre-trained model and re-train with a low
learning rate.​

8.​ Evaluate the Model​

○​ Use validation/test data. Assess performance using metrics like accuracy,


loss, precision, or recall.​
Fine-tuning Strategies in CNNs (Transfer Learning)

1. Freezing Layers

●​ Purpose: Preserve previously learned features.​

●​ How:​

○​ Freeze early layers (low-level features like edges/textures).​

○​ Optionally freeze all layers except the final one.​

○​ Only train the last few or new layers.​

2. Modifying Layers

●​ Output Layer:​

○​ Replace if the number of classes is different from the original task.​

●​ Input Layer:​

○​ Adjust if the input feature size or shape has changed.​

●​ Add New Layers:​

○​ Add custom layers (Dense, Dropout, etc.) and train only these initially.​

3. Adjusting the Learning Rate

●​ Use a smaller learning rate during fine-tuning:​

○​ Allows gradual adaptation.​

○​ Prevents drastic changes to useful learned features.​


●​ Typical strategy:​

○​ Use 1/10th of the original learning rate.​

○​ Example:​

■​ Original task: lr = 0.01​

■​ Fine-tuning: lr = 0.001

Why are Hyperparameters Important in CNNs?

●​ Influence training speed – Determines how fast the model learns.​

●​ Affect model accuracy – Key to achieving high prediction performance.​

●​ Control overfitting and underfitting – Help balance model complexity.​

●​ Crucial for generalization – Ensure good performance on unseen/test data.​

●​ Enable faster convergence – Reduce training time by optimizing learning steps.​

●​ Improve overall model performance – Well-tuned settings lead to better results.​

Example: CNN Hyperparameter Setup # Sample CNN Hyperparameters

learning_rate = 0.001

batch_size = 32

epochs = 50

optimizer = 'Adam'

num_filters = [32, 64, 128]

filter_size = (3, 3)

stride = 1

padding = 'same'

activation = 'ReLU'

dropout_rate = 0.5

You might also like